Regionalisation with Spatially Constrained Cluster Analysis

Published

December 5, 2022

case study : Regionalisation by Multivariate Water Point Attributes with Non-spatially Constrained and Spatially Constrained Clustering Methods.

1. OVERVIEW

Regionalisation with Spatially Constrained clustering analysis requires similar observations to be grouped according to their statistical attributes and spatial location.

1.1 Objectives

  • Total number of water points by status, i.e. functional, non-functional, and unknown;

  • Percentage of water points by :

    • status (functional, non-functional, and unknown);

    • deployed water technology (hand pump, mechanical pump, stand tap, etc.) ;

    • usage capacity (1000, 300, 250, 50);

    • rural or urban.

1.2 Study Area Introduction

Alpha-3 Code : NGA

Population : 225 million (1st in Africa, 6th globally)

Local Government Areas (LGA) : 774

Water Point Observations : 95,008

Environmental Aspects :

  • Geography :

    • Southwest - “rugged” highland.

    • Southeast - hills and mountains, which form the Mambilla Plateau, the highest plateau in Nigeria.

  • Hydrology :

    • Two (2) main catchment areas - Chad Basin & Niger catchment area.

    • Surface area of lake Chad is shrinking recent decades due to irrigation activities.1

    • Untreated wastes dump in places resulted in waterways and groundwater pollution.2

  • Vegetation Coverage :

    • Lost nearly 80% of primary forest by 2012.3

    • States with dense forests concentrated : Bayelsa, Cross River, Edo, Ekiti, Ondo, Osun, Rivers, and Taraba.

  • 1 Wikipedia. Nigeria. https://en.wikipedia.org/wiki/Nigeria

  • 2 Ogbonna, D.N., Ekweozor, I.K.E., Igwe, F.U. (2002). “Waste Management: A Tool for Environmental Protection in Nigeria”. Ambio: A Journal of the Human Environment. 31 (1): 55–57. doi:10.1639/0044-7447(2002)031[0055:wmatfe]2.0.co;2.

  • 3 https://rainforests.mongabay.com/20nigeria.htm

  • 1.3 Scope of Works

    • import the shapefile into R with the appropriate sf method, and save it in a simple feature data frame format;
    Note

    note

    Three (3) Projected Coordinate Systems of Nigeria, EPSG : 26391, 26392, and 26303.

    • derive the proportion of functional and non-functional water points at LGA level (i.e. ADM2) by appropriate tidyr and dplyr methods;

    • combine geospatial and aspatial data frames into a simple feature data frame.

    • delineate water points measures functional regions by using :

      • conventional hierarchical clustering.

      • spatially constrained clustering algorithms.

    • plot two (2) main types of maps below :

      Thematic Mapping

      Show the derived water-point measures by appropriate statistical graphics and choropleth mapping technique.

      Analytical Mapping

      Plot delineated functional regions using non-spatially constrained and spatially constrained clustering algorithms.

    2. R PACKAGE REQUIRED

    The following are the packages required for this exercise :

    2.1 Load R Packages

    Usage of the code chunk below :

    p_load( ) - pacman - to load packages. This function will attempt to install the package from CRAN or pacman repository list if its found not installed.

    pacman::p_load(sf, tidyverse, questionr, janitor, psych, ggplot2, gcookbook, tmap, ggpubr, egg, corrplot, gtsummary, regclass, caret, heatmaply, ggdendro, cluster, factoextra, spdep, ClustGeo, GGally, skimr, stringr)


    3. GEOSPATIAL DATA

    3.1 Acquire Data Source

    Note

    The file size of the downloaded data is about 422 MB due to water points data from multiple countries.

    • Such file size may require extra effort and time to manage the code chunks and files in the R environment before pushing them to GitHub.

    Hence, to avoid any error in pushing files larger than 100 MB to Git, filtered Nigeria water points and removed unnecessary variables before uploading into the R environment.

    Therewith, the CSV file size should be lesser than 100 MB.

  • 4 Runfola, D. et al. (2020) geoBoundaries: A global database of political administrative boundaries. PLoS ONE 15(4): e0231866. https://doi.org/10.1371/journal.pone.0231866

  • 3.2 Import Attribute Data

    Four (4) data frames to be created for different context, i.e.

    • wp_coord = coordinated related variables.

    • wp_cond = status and conditions related variables.

    • wp_adm = administrative and measurements related variables.

    • wp = master file that combine all three (3) data frames above.

    3.2.4 Create Master File

    Usage of the code chunk below :

    left_join( ) - dplyr - to combine wp_coord, wp_cond and wp_adm.

    wp <- left_join(
      
      (left_join
       (wp_coord,wp_cond,
         by = c("row_id")
         )
       ),
      wp_adm, by = c("row_id"))

    3.2.5 Convert Well Known Text (WKT) Data to SF Data Frame

    • The “New Georeferenced Column” in wp_rds contains spatial data in a WKT format.

    • Two (2) steps to convert the WKT data format into an sf data frame.

    3.2.5.1 derive new field :: “geometry”

    Usage of the code chunk below :

    st_as_sfc( ) - sf - to convert foreign geometry object `New Georeferenced Column` to an sfc object

    wp$geometry = st_as_sfc(wp$`New Georeferenced Column`)

    3.2.5.2 convert to SF Data Frame

    Usage of the code chunk below :

    st_sf( ) - sf - to convert the tibble data frame into sf data frame with crs first set to WGS 84 (EPSG : 4326).

    st_crs( ) - sf - to retrieve coordinate reference system from the object.

    wp_sf<- st_sf(wp, crs = 4326)
    st_crs(wp_sf)
    Coordinate Reference System:
      User input: EPSG:4326 
      wkt:
    GEOGCRS["WGS 84",
        ENSEMBLE["World Geodetic System 1984 ensemble",
            MEMBER["World Geodetic System 1984 (Transit)"],
            MEMBER["World Geodetic System 1984 (G730)"],
            MEMBER["World Geodetic System 1984 (G873)"],
            MEMBER["World Geodetic System 1984 (G1150)"],
            MEMBER["World Geodetic System 1984 (G1674)"],
            MEMBER["World Geodetic System 1984 (G1762)"],
            MEMBER["World Geodetic System 1984 (G2139)"],
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ENSEMBLEACCURACY[2.0]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        CS[ellipsoidal,2],
            AXIS["geodetic latitude (Lat)",north,
                ORDER[1],
                ANGLEUNIT["degree",0.0174532925199433]],
            AXIS["geodetic longitude (Lon)",east,
                ORDER[2],
                ANGLEUNIT["degree",0.0174532925199433]],
        USAGE[
            SCOPE["Horizontal component of 3D system."],
            AREA["World."],
            BBOX[-90,-180,90,180]],
        ID["EPSG",4326]]

    3.3 Import Boundary Data

    Usage of the code chunk below :

    st_read( ) - sf - to read simple features.

    select( ) - dplyr - to select “shapeName” variable.

    bdy_nga <- st_read(dsn = "data/geospatial",
                   layer = "geoBoundaries-NGA-ADM2",
                   crs = 4326) %>%
      select(shapeName)
    Reading layer `geoBoundaries-NGA-ADM2' from data source 
      `D:\jephOstan\ISSS624\class_project\project_2\data\geospatial' 
      using driver `ESRI Shapefile'
    Simple feature collection with 774 features and 5 fields
    Geometry type: MULTIPOLYGON
    Dimension:     XY
    Bounding box:  xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
    Geodetic CRS:  WGS 84
    problems(bdy_nga)

    3.3.1 Review Imported File

    3.3.1.1 check for missing and duplicated data

    skim(bdy_nga)
    Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No
    user-defined `sfl` provided. Falling back to `character`.
    Data summary
    Name bdy_nga
    Number of rows 774
    Number of columns 2
    _______________________
    Column type frequency:
    character 2
    ________________________
    Group variables None

    Variable type: character

    skim_variable n_missing complete_rate min max empty n_unique whitespace
    shapeName 0 1 3 18 0 768 0
    geometry 0 1 878 33370 0 774 0

    Remarks :

    • There is no missing data.

    • There 774 unique “geometry” but only 768 unique “shapeName”

      • That’s mean there are 6 values of “shapeName” duplicated among the identified unique shapeNames.

    3.3.1.2 list the unique “shapeName” associated with duplication

    Usage of the code chunk below :

    add_count( ) - dplyr - to count observations by group

    filter( ) - dplyr - to retain shapeName that has count not equal to 1.

    dupl_shapeName <- bdy_nga %>%
      add_count(bdy_nga$shapeName) %>%
      filter(n!=1) %>%
      select(-n)
    
    freq(dupl_shapeName$shapeName)
             n    % val%
    Bassa    2 16.7 16.7
    Ifelodun 2 16.7 16.7
    Irepodun 2 16.7 16.7
    Nasarawa 2 16.7 16.7
    Obi      2 16.7 16.7
    Surulere 2 16.7 16.7

    3.3.1.3 verify findings in section 3.3.1.2

    Usage of the code chunk below :

    tmap_mode( ) - tmap - to set tmap mode to static plotting or interactive.

    tm_shape( ) - tmap - to specify the shape object.

    tm_polygons( ) - tmap - to fill the polygons and draw the polygon borders.

    tm_view( ) - tmap - to set the options for the interactive tmap viewer.

    tm_fill( ) - tmap - to specify either which colour to be used or which data variable mapped to the colour palette.

    tm_borders( ) - tmap - to draw the polygon borders.

    tmap_style( ) - tmap - to set the tmap style.

    tm_layout( ) - tmap - to set the layout of cartographic map.

    tmap_mode("view")
    tmap mode set to interactive viewing
    tm_shape(bdy_nga)+
      tm_polygons()+
      tm_view(set.zoom.limits = c(6,8))+
    
    tm_shape(dupl_shapeName)+
      tm_fill("shapeName",
              n = 6,
              style = "jenks")+
      tm_borders(alpha = 0.5)+
      tmap_style("albatross")+
      tm_layout(main.title = "Distribution of Duplicated ShapeName",
                main.title.size = 1.3,
                main.title.position = "center")
    tmap style set to "albatross"
    other available styles are: "white", "gray", "natural", "cobalt", "col_blind", "beaver", "bw", "classic", "watercolor" 

    Remarks :

    The plot above indicates those duplicated water points are not within the same location.

    tmap_mode("plot")
    tmap mode set to plotting

    3.3.1.4 acquire State info for duplicated value

    The State info to be combined with the duplicated “shapeName”. This will make all the shapeName unique.

    dupl_shapeName %>%
      mutate(st_centroid(dupl_shapeName$geometry, of_largest_polygon = FALSE))
    Simple feature collection with 12 features and 2 fields
    Active geometry column: geometry
    Geometry type: MULTIPOLYGON
    Dimension:     XY
    Bounding box:  xmin: 3.316459 ymin: 6.459038 xmax: 9.020704 ymax: 12.05035
    Geodetic CRS:  WGS 84
    First 10 features:
       shapeName bdy_nga$shapeName                       geometry
    1      Bassa             Bassa MULTIPOLYGON (((6.708541 7....
    2      Bassa             Bassa MULTIPOLYGON (((8.823522 10...
    3   Ifelodun          Ifelodun MULTIPOLYGON (((4.664107 8....
    4   Ifelodun          Ifelodun MULTIPOLYGON (((4.721977 7....
    5   Irepodun          Irepodun MULTIPOLYGON (((5.05493 8.0...
    6   Irepodun          Irepodun MULTIPOLYGON (((4.543349 7....
    7   Nasarawa          Nasarawa MULTIPOLYGON (((8.554589 11...
    8   Nasarawa          Nasarawa MULTIPOLYGON (((7.493228 8....
    9        Obi               Obi MULTIPOLYGON (((8.191919 6....
    10       Obi               Obi MULTIPOLYGON (((9.008576 8....
       st_centroid(dupl_shapeName$geometry, of_largest_polygon = FALSE)
    1                                         POINT (7.031827 7.791971)
    2                                         POINT (8.782946 10.08015)
    3                                         POINT (5.052235 8.544311)
    4                                         POINT (4.636735 7.920948)
    5                                         POINT (4.926215 8.169349)
    6                                          POINT (4.498797 7.84861)
    7                                         POINT (8.578262 12.00446)
    8                                         POINT (7.760272 8.304034)
    9                                         POINT (8.281026 7.022495)
    10                                         POINT (8.734777 8.35534)
    lga row_id headquarter state iso3166code dupl_shapeName_coord lga_office_coord
    Bassa 94 Oguma Kogi NG.KO.BA 7.791971, 7.031827 7.80932, 6.74853
    Bassa 95 Bassa Plateau NG.PL.BA 10.08015, 8.782946 10.11143, 8.71559
    Ifelodun 304 Share Kwara NG.KW.IF 8.544311, 5.052235 8.5 5.0
    Ifelodun 305 Ikirun Osun NG.OS.ID 7.920948, 4.636735  7.5 4.5
    Irepodun 355 Omu Aran Kwara NG.KW.IR 8.169349, 4.926215  8.5 5.0
    Irepodun 356 Ilobu Osun NG.OS.IP 7.84861, 4.498797  7.5 4.5
    Nasarawa 519 Bompai Kano NG.KN.NA 12.00446,
    8.578262 11.5 8.5
    Nasarawa 520 Nasarawa Nasarawa NG.NA.NA 8.304034, 7.760272 8.53477, 7.70153
    Obi 546 Obarike-Ito Benue NG.BE.OB 7.022495, 8.281026 7.01129, 8.33118
    Obi 547 Obi Nasarawa NG.NA.OB 8.35534, 8.734777 8.37944, 8.78561
    Surelere 693 Surelere Lagos NG.LA.SU 6.493217, 3.346919 6.50048, 3.35488
    Surelere 694 Iresa-Adu Oyo NG.OY.SU 8.088897, 4.393574 8.08459, 4.38538


    3.4 Data Wrangling

    3.4.1 Edit Duplicated Value :: “shapeName”

    Two (2) Main Steps involved :

    1. Combine “shapeName” with the State name to make each of them unique.
    2. Replace the “shapeName” value according to each row index.5
  • 5 Ong Z.R.J. (2022). Geospatial Analytics for Social Good - Understanding Nigeria Water functional and non-functional water point rate. https://jordan-isss624-geospatial.netlify.app/posts/geo/geospatial_exercise/#checking-of-duplicated-area-name

  • bdy_nga$shapeName[c(94,95,304,305,355,356,519,520,546,547,693,694)] <- 
      c("Bassa Kogi",
        "Bassa Plateau",
        "Ifelodun Kwara",
        "Ifelodun Osun",
        "Irepodun Kwara",
        "Irepodun Osun",
        "Nasarawa Kano",
        "Nasarawa Nasarawa",
        "Obi Nasarawa",
        "Obi Benue",
        "Surulere Lagos",
        "Surulere Oyo")
    
    bdy_nga$shapeName[c(94,95,304,305,355,356,519,520,546,547,693,694)]
     [1] "Bassa Kogi"        "Bassa Plateau"     "Ifelodun Kwara"   
     [4] "Ifelodun Osun"     "Irepodun Kwara"    "Irepodun Osun"    
     [7] "Nasarawa Kano"     "Nasarawa Nasarawa" "Obi Nasarawa"     
    [10] "Obi Benue"         "Surulere Lagos"    "Surulere Oyo"     

    3.4.1.1 validate edited value :: “shapeName”

    dupl_shapeName_val <- bdy_nga %>%
      add_count(bdy_nga$shapeName) %>%
      filter(n!=1) %>%
      select(-n)
    
    dupl_shapeName_val
    Simple feature collection with 0 features and 2 fields
    Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
    Geodetic CRS:  WGS 84
    [1] shapeName         bdy_nga$shapeName geometry         
    <0 rows> (or 0-length row.names)

    3.4.2 Perform Point-in-Polygon Overlay

    Combine both attribute and boundary of the water points into a simple feature object.

    3.4.2.1 join objects :: wp_sf and bdy_nga

    Usage of the code chunk below :

    st_join( ) - sf - to join sf-class objects based on geometry, namely, wp_sf and bdy_nga.

    wp_joined <- st_join(x = wp_sf,
                         y = bdy_nga,
                         join = st_intersects,
                         left = TRUE)

    3.4.2.2 save and read RDS File :: wp_joined

    write_rds(wp_joined,"data/geodata/wp_joined.rds",compress = "xz")
    
    wp_joined <- read_rds("data/geodata/wp_joined.rds")

    3.4.2.3 inspect joined file :: wp_joined

    -- retrieve geometry summary

    st_geometry(wp_joined)
    Geometry set for 95008 features 
    Geometry type: POINT
    Dimension:     XY
    Bounding box:  xmin: 2.707441 ymin: 4.301812 xmax: 14.21828 ymax: 13.86568
    Geodetic CRS:  WGS 84
    First 5 geometries:
    POINT (10.47318 10.60104)
    POINT (6.95009 6.78599)
    POINT (7.615451 6.799595)
    POINT (7.30539 6.30817)
    POINT (10.44625 10.50681)

    -- assess uniqueness of Water Point

    wp_joined %>% janitor::get_dupes(shapeName,lat_lon_deg)
    No duplicate combinations found of: shapeName, lat_lon_deg
    Simple feature collection with 0 features and 24 fields
    Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
    Geodetic CRS:  WGS 84
    # A tibble: 0 × 25
    # … with 25 variables: shapeName <chr>, lat_lon_deg <chr>, dupe_count <int>,
    #   row_id <dbl>, lat_deg <dbl>, lon_deg <dbl>, New Georeferenced Column <chr>,
    #   water_source <chr>, water_source_clean <chr>, water_source_category <chr>,
    #   water_tech_clean <chr>, water_tech_category <chr>, status_clean <chr>,
    #   status <chr>, clean_adm1 <chr>, clean_adm2 <chr>,
    #   water_point_population <dbl>, local_population_1km <dbl>,
    #   crucialness_score <dbl>, pressure_score <dbl>, usage_capacity <dbl>, …

    Remarks :

    Each water point observation is unique as there are no duplicate combination of “shapeName” together with “lat_lon_deg”.

    -- determine reference point :: “shapeName” or “clean_adm2”

    wp_reference <- (wp_joined$shapeName == wp_joined$clean_adm2)
    freq(wp_reference)
              n    % val%
    FALSE 29856 31.4 31.4
    TRUE  65123 68.5 68.6
    NA       29  0.0   NA

    Remarks :

    • There are 29,713 “FALSE”, which is more than 30% of LGA names mismatched between “shapeName” and “clean_adm2”.

      • Since the geoBoundaries data is collected from government-published and reliable internet sources.6

        • Hence, the “shapeName” variable to be used throughout this study.
    • The 29 NA’s are 29 water points located beyond the LGA boundary, as shown in the plot below.

  • 6 Daniel et. al (2020) geoBoundaries: A global database of political administrative boundaries. PlosOne. https://doi.org/10.1371/journal.pone.0231866

  • tmap_mode("view")
    tmap mode set to interactive viewing
    tm_shape(bdy_nga) +
      tm_polygons() +
      tm_view(set.zoom.limits = c(5.5, 12))+
      
    tm_shape(filter(wp_joined, is.na(wp_reference))) +
      tm_dots(size=0.1,
              col="red")
    tmap_mode("plot")
    tmap mode set to plotting

    3.4.3 Remove Water Point Outside LGA Boundary

    wp_joined1 <- wp_joined %>% filter(shapeName == clean_adm2 | shapeName != clean_adm2)

    3.4.4 Replace “NA” with “Unknown”

    3.4.4.1 identify variable with missing value

    skim(wp_joined1)
    Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
    provided. Falling back to `character`.
    Data summary
    Name wp_joined1
    Number of rows 94979
    Number of columns 24
    _______________________
    Column type frequency:
    character 13
    logical 1
    numeric 10
    ________________________
    Group variables None

    Variable type: character

    skim_variable n_missing complete_rate min max empty n_unique whitespace
    New Georeferenced Column 0 1.00 11 45 0 94979 0
    lat_lon_deg 0 1.00 8 42 0 94979 0
    water_source 0 1.00 3 32 0 16 0
    water_source_clean 302 1.00 8 22 0 5 0
    water_source_category 302 1.00 4 11 0 3 0
    water_tech_clean 10054 0.89 8 26 0 11 0
    water_tech_category 10054 0.89 8 15 0 4 0
    status_clean 10648 0.89 9 32 0 8 0
    status 10648 0.89 14 156 0 834 0
    clean_adm1 0 1.00 3 25 0 37 0
    clean_adm2 0 1.00 3 19 0 753 0
    geometry 0 1.00 7 37 0 94979 0
    shapeName 0 1.00 3 18 0 761 0

    Variable type: logical

    skim_variable n_missing complete_rate mean count
    is_urban 0 1 0.21 FAL: 75423, TRU: 19556

    Variable type: numeric

    skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
    row_id 0 1.00 199938.21 189720.69 10732.00 52622.50 86939.00 323664.50 681838.00 ▇▃▂▂▁
    lat_deg 0 1.00 9.33 2.48 4.30 7.36 9.09 11.83 13.87 ▃▇▅▅▆
    lon_deg 0 1.00 7.50 2.25 2.71 5.52 7.89 9.08 14.22 ▃▃▇▃▁
    water_point_population 537 0.99 1246.24 4027.60 0.00 117.00 413.00 1169.00 384595.00 ▇▁▁▁▁
    local_population_1km 537 0.99 3723.38 7418.08 0.00 597.00 1756.00 4393.00 384595.00 ▇▁▁▁▁
    crucialness_score 6873 0.93 0.41 0.34 0.00 0.13 0.30 0.63 1.00 ▇▅▃▁▅
    pressure_score 6873 0.93 3.21 9.04 0.00 0.40 1.18 3.10 776.97 ▇▁▁▁▁
    usage_capacity 0 1.00 488.63 310.95 50.00 300.00 300.00 1000.00 1000.00 ▁▇▁▁▃
    staleness_score 0 1.00 44.94 6.29 23.13 41.49 42.87 44.34 99.00 ▁▇▁▁▁
    rehab_priority 53089 0.44 1545.55 5244.05 0.00 136.00 522.00 1527.00 384595.00 ▇▁▁▁▁

    3.4.4.2 replace “NA” with “Unknown”

    mutate( ) - dplyr - to run replace_na( ) function.

    • replace_na( ) - tidyr - to replace NAs with “unknown”.
    wp_joined1 <- wp_joined1 %>%
      mutate(status_clean = replace_na(status_clean, "Unknown")) %>%
      mutate(water_tech_category = replace_na(water_tech_category, "Unknown")) %>%
      mutate(status = replace_na(status, "Unknown")) %>%
      mutate(water_point_population = replace_na(water_point_population, 0)) %>%
      mutate(local_population_1km = replace_na(local_population_1km, 0)) %>%
      mutate(crucialness_score = replace_na(crucialness_score, 0)) %>%
      mutate(pressure_score = replace_na(pressure_score, 0))
    unique(wp_joined1$status_clean)
    [1] "Non-Functional"                   "Unknown"                         
    [3] "Functional"                       "Functional but needs repair"     
    [5] "Abandoned/Decommissioned"         "Functional but not in use"       
    [7] "Non-Functional due to dry season" "Abandoned"                       
    [9] "Non functional due to dry season"

    3.4.5 Standardise Value

    3.4.5.1 combine value :: “status_clean”

    wp_joined1$status_clean[wp_joined1$status_clean == "Non functional due to dry season"] <- "Non-Functional due to dry season"
    
    wp_joined1$status_clean[wp_joined1$status_clean == "Abandoned"] <- "Abandoned/Decommissioned"

    3.4.5.2 review “status_clean”

    unique(wp_joined1$status_clean)
    [1] "Non-Functional"                   "Unknown"                         
    [3] "Functional"                       "Functional but needs repair"     
    [5] "Abandoned/Decommissioned"         "Functional but not in use"       
    [7] "Non-Functional due to dry season"

    3.4.4.3 read RDS file :: wp_joined1

    wp_joined1 <- read_rds("data/geodata/wp_joined1.rds")

    3.4.5 Extract Water Point for New Table :: wp_nga

    3.4.5.1 extract functional water point

    wpt_functional <- wp_joined1 %>%
      filter(status_clean %in%
               c("Functional", 
                 "Functional but not in use",
                 "Functional but needs repair"))

    -- save and read RDS file :: wpt_functional

    write_rds(wpt_functional,"data/geodata/wpt_functional.rds",compress = "xz")
    
    wpt_functional <- read_rds("data/geodata/wpt_functional.rds")

    3.4.5.2 inspect variable and value

    -- reveal value :: “status_clean”

    freq(wpt_functional$status_clean)
                                    n    % val%
    Functional                  45883 88.0 88.0
    Functional but needs repair  4579  8.8  8.8
    Functional but not in use    1686  3.2  3.2
    length(wpt_functional$row_id)
    [1] 52148
    length(wpt_functional$row_id)/length(wp_joined1$row_id)*100
    [1] 54.88801

    Remarks :

    The total functional water points is 52,148 which is about 54.89% of total water points.

    -- reveal value :: “usage_capacity”

    freq(wpt_functional$usage_capacity)
             n    % val%
    50       2  0.0  0.0
    250     75  0.1  0.1
    300  38064 73.0 73.0
    1000 14007 26.9 26.9

    -- reveal value “usage_capacity” by “status_clean”

    wpt_functional %>% count(status_clean, usage_capacity, sort = TRUE)
    Simple feature collection with 10 features and 3 fields
    Geometry type: MULTIPOINT
    Dimension:     XY
    Bounding box:  xmin: 2.711632 ymin: 4.302938 xmax: 13.5022 ymax: 13.86331
    Geodetic CRS:  WGS 84
    # A tibble: 10 × 4
       status_clean                usage_capacity     n                     geometry
     * <chr>                                <dbl> <int>             <MULTIPOINT [°]>
     1 Functional                             300 33687 ((3.064921 7.994882), (3.06…
     2 Functional                            1000 12124 ((3.080189 7.99252), (3.085…
     3 Functional but needs repair            300  3306 ((3.340832 8.037962), (3.34…
     4 Functional but needs repair           1000  1271 ((3.373801 7.992051), (3.33…
     5 Functional but not in use              300  1071 ((3.046639 8.017765), (2.88…
     6 Functional but not in use             1000   612 ((3.088655 8.005296), (3.05…
     7 Functional                             250    70 ((3.355785 6.498105), (3.67…
     8 Functional but not in use              250     3 ((8.032945 6.878883), (7.00…
     9 Functional                              50     2 ((7.027967 4.765731), (8.92…
    10 Functional but needs repair            250     2 ((6.465915 5.826699), (7.93…

    -- reveal value :: “crucialness_score”

    summary(wpt_functional$crucialness_score == 1)
       Mode   FALSE    TRUE 
    logical   47006    5142 

    -- determine the total population within 1 km by “crucialness_score”

    freq(wpt_functional$crucialness_score == 1)
              n    % val%
    FALSE 47006 90.1 90.1
    TRUE   5142  9.9  9.9
    sum(wpt_functional[wpt_functional$crucialness_score == 1,]$local_population_1km)
    [1] 11252574

    Remarks :

    Given the “crucialness_score” is a ratio of current water point users to the total population within 1 km radius thereof :

    • Currently, 5,142 water points serve the population within a 1 km radius at its capacity limit.

      • The usage capacity may need to be increased to sustain or improve the growth or development rate within 1km of these water points.

      • Should the population within 1 km therefrom grow above 11,252,574, there may be multiple repercussions in resources management, urbanisation progress, local food and beverage consumption, local commodity prices, or worst case scenario would be the stability of civil society.

    summary(wpt_functional$pressure_score > 1)
       Mode   FALSE    TRUE 
    logical   27469   24679 
    length(wpt_functional$pressure_score)
    [1] 52148
    24679/52148*100 #percentage of functional waterpoints over their usage limit
    [1] 47.32492

    Remarks :

    Given the “pressure_score” is the ratio of the current water point users to the usage capacity thereof :

    • 24,679, or 47.32% of functional water points, are currently over their limit of usage capacity.

    3.4.5.3 Exploratory Data Analysis (EDA) :: wpt_functional

    -- plot “status_clean”

    ggplot(data = wpt_functional,
           aes(fct_infreq(status_clean), fill=status_clean))+ 
      geom_bar()+
      geom_text(
         aes(label=after_stat(count)),
         stat='count',
         nudge_x=-0.25,
         vjust=-0.2)+
      geom_text(
         aes(label= scales::percent(signif(after_stat(count/sum(count))))),
         stat='count',
         nudge_x=0.25,
         vjust=-0.2)+
      scale_x_discrete(guide = guide_axis(n.dodge = 2))+
      guides(fill=guide_legend (title = 'Status'))

    -- plot “water_tech_category”

    ggplot(data=wpt_functional, 
           aes(x=fct_infreq(
             water_tech_category)))+
      geom_bar(aes(
        fill = water_tech_category), 
        width = 0.8)+
      geom_text(aes(
        label = ..count..),
        stat = "count", 
        vjust=-0.2, 
        size = 3.5, 
        color = "black")+
      scale_x_discrete(guide = guide_axis(n.dodge = 2))+
      guides(fill=guide_legend (title = 'Water Tech Deployed'))
    Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
    ℹ Please use `after_stat(count)` instead.

    -- plot “water_source_clean”

    ggplot(data=wpt_functional, 
           aes(x=fct_infreq(
             water_source_clean)))+
      geom_bar(aes(
        fill = water_source_clean), 
        width = 0.8)+
      geom_text(aes(
        label = ..count..),
        stat = "count", 
        vjust=-0.2, 
        size = 3.5, 
        color = "black")+
      scale_x_discrete(guide = guide_axis(
        n.dodge = 2))+
      guides(fill=guide_legend (
        title = 'Source of Water Supply'))

    3.4.5.4 add attribute to new data table

    wp_nga <- bdy_nga %>%
      mutate(`total_wp` = lengths(
        st_intersects(bdy_nga, wp_joined1))) %>%
      
      mutate(`wp_functional` = lengths(
        st_intersects(bdy_nga, wpt_functional))) %>%
      
      mutate(`pct_functional` = (`wp_functional`/`total_wp`*100))

    -- replace “NaN” with 0

    wp_nga <- wp_nga %>%
      mutate(`pct_functional` = replace_na(pct_functional, 0))
    
    summary(wp_nga)
      shapeName                  geometry      total_wp     wp_functional   
     Length:774         MULTIPOLYGON :774   Min.   :  0.0   Min.   :  0.00  
     Class :character   epsg:4326    :  0   1st Qu.: 45.0   1st Qu.: 17.00  
     Mode  :character   +proj=long...:  0   Median : 96.0   Median : 45.50  
                                            Mean   :122.7   Mean   : 67.36  
                                            3rd Qu.:168.8   3rd Qu.: 87.75  
                                            Max.   :894.0   Max.   :752.00  
     pct_functional  
     Min.   :  0.00  
     1st Qu.: 32.61  
     Median : 47.41  
     Mean   : 49.84  
     3rd Qu.: 66.99  
     Max.   :100.00  

    3.4.5.5 extract non-functional water point

    wpt_nonFunctional <- wp_joined1 %>%
      filter(status_clean %in%
               c("Abandoned/Decommissioned", 
                 "Non-Functional",
                 "Non-Functional due to dry season"))

    -- save and read RDS file :: wpt_nonFuntional

    write_rds(wpt_nonFunctional,"data/geodata/wpt_nonFunctional.rds",compress = "xz")
    
    wpt_nonFunctional <- read_rds("data/geodata/wpt_nonFunctional.rds")

    3.4.5.6 inspect variable and value

    -- reveal value :: “status_clean”

    freq(wpt_nonFunctional$status_clean)
                                         n    % val%
    Abandoned/Decommissioned           234  0.7  0.7
    Non-Functional                   29385 91.7 91.7
    Non-Functional due to dry season  2410  7.5  7.5
    length(wpt_nonFunctional$row_id)
    [1] 32029
    length(wpt_nonFunctional$row_id)/length(wp_joined1$row_id)*100
    [1] 33.7119

    Remarks :

    There are 32,204, which is about 33.9% out of total water points.

    -- reveal value :: “usage_capacity”

    freq(wpt_nonFunctional$usage_capacity)
             n    % val%
    250     41  0.1  0.1
    300  20586 64.3 64.3
    1000 11402 35.6 35.6

    -- reveal value “usage_capacity” by “status_clean”

    wpt_nonFunctional %>% count(status_clean, usage_capacity, sort = TRUE)
    Simple feature collection with 7 features and 3 fields
    Geometry type: MULTIPOINT
    Dimension:     XY
    Bounding box:  xmin: 2.707441 ymin: 4.301812 xmax: 13.4192 ymax: 13.86567
    Geodetic CRS:  WGS 84
    # A tibble: 7 × 4
      status_clean                     usage_capac…¹     n                  geometry
    * <chr>                                    <dbl> <int>          <MULTIPOINT [°]>
    1 Non-Functional                             300 18492 ((3.064526 7.994448), (3…
    2 Non-Functional                            1000 10852 ((3.083391 7.993231), (3…
    3 Non-Functional due to dry season           300  2012 ((3.051752 7.984243), (3…
    4 Non-Functional due to dry season          1000   398 ((3.056661 7.985808), (3…
    5 Abandoned/Decommissioned                  1000   152 ((4.713438 7.891137), (4…
    6 Abandoned/Decommissioned                   300    82 ((3.199483 8.912549), (2…
    7 Non-Functional                             250    41 ((3.976195 6.582998), (3…
    # … with abbreviated variable name ¹​usage_capacity

    -- reveal “crucialness_score”

    sum(wpt_nonFunctional$local_population_1km)
    [1] 93999535
    sum(wpt_nonFunctional$water_point_population)
    [1] 46255888

    Remarks :

    Given the “crucialness_score” is a ratio of current water point users to the total population within a 1 km radius thereof , in the context of non-functional :

    • Currently, out of 95,013,340 residents within a 1 km radius, there are 46,710,127 of them is affected by these non-functional water point.

    3.4.5.7 EDA :: wpt_nonFunctional

    -- plot “status_clean”

    ggplot(data = wpt_nonFunctional,
           aes(fct_infreq(status_clean), 
               fill=status_clean))+ 
      geom_bar()+
      geom_text(
         aes(label=after_stat(count)),
         stat='count',
         nudge_x=-0.25,
         vjust=-0.2)+
      geom_text(
         aes(label= scales::percent(
           signif(
             after_stat(
               count/sum(count)
               )))),
         stat='count',
         nudge_x=0.25,
         vjust=-0.2)+
      scale_x_discrete(
        guide = guide_axis(
          n.dodge = 2))+
      guides(fill=guide_legend (
        title = 'Status'))

    -- plot “water_tech_category”

    ggplot(data=wpt_nonFunctional, 
           aes(fct_infreq(
             water_tech_category)))+
      geom_bar(aes(
        fill = water_tech_category), 
        width = 0.8)+
      geom_text(aes(
        label = ..count..),
        stat = "count",
        vjust=-0.2,
        size = 3.5,
        color = "black")+
      scale_x_discrete(guide = guide_axis(
        n.dodge = 2))+
      guides(fill=guide_legend (
        title = 'Water Tech Deployed'))

    -- plot “water_source_clean”

    ggplot(data=wpt_nonFunctional, 
           aes(fct_infreq(
             water_source_clean)))+
      geom_bar(aes(
        fill = water_source_clean),
        width = 0.8)+
      geom_text(aes(
        label = ..count..),
        stat = "count",
        vjust=-0.2,
        size = 3.5,
        color = "black")+
      scale_x_discrete(guide = guide_axis(
        n.dodge = 2))+
      guides(fill=guide_legend (
        title = 'Source of Water Supply'))

    3.4.5.8 add wpt_nonFunctional to wp_nga

    wp_nga <- wp_nga %>%
      mutate(`wp_nonFunctional` = lengths(
        st_intersects(bdy_nga, wpt_nonFunctional))) %>%
      mutate(`pct_nonFunctional` = (`wp_nonFunctional`/`total_wp`*100))

    -- replace “NaN” with 0

    wp_nga <- wp_nga %>%
      mutate(`pct_nonFunctional` = replace_na(pct_nonFunctional, 0))
    
    summary(wp_nga)
      shapeName                  geometry      total_wp     wp_functional   
     Length:774         MULTIPOLYGON :774   Min.   :  0.0   Min.   :  0.00  
     Class :character   epsg:4326    :  0   1st Qu.: 45.0   1st Qu.: 17.00  
     Mode  :character   +proj=long...:  0   Median : 96.0   Median : 45.50  
                                            Mean   :122.7   Mean   : 67.36  
                                            3rd Qu.:168.8   3rd Qu.: 87.75  
                                            Max.   :894.0   Max.   :752.00  
     pct_functional   wp_nonFunctional pct_nonFunctional
     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   
     1st Qu.: 32.61   1st Qu.: 12.00   1st Qu.: 20.77   
     Median : 47.41   Median : 33.50   Median : 34.89   
     Mean   : 49.84   Mean   : 41.37   Mean   : 35.58   
     3rd Qu.: 66.99   3rd Qu.: 60.00   3rd Qu.: 50.00   
     Max.   :100.00   Max.   :278.00   Max.   :100.00   

    3.4.5.9 extract unknown water point

    wpt_unknown <- wp_joined1 %>%
      filter(status_clean == "Unknown")

    -- save and read RDS file :: wpt_unknown

    write_rds(wpt_unknown,"data/geodata/wpt_unknown.rds")
    
    wpt_unknown <- read_rds("data/geodata/wpt_unknown.rds")

    3.4.5.10 inspect variable and value

    -- reveal value :: “status_clean”

    length(wpt_unknown$row_id)
    [1] 10656
    length(wpt_unknown$row_id)/length(wp_joined1$row_id)*100
    [1] 11.2159

    Remarks :

    There are 10,656 water points with unknown status, about 11.22% of total water points.

    -- determine affected population

    sum(wpt_unknown$water_point_population)
    [1] 18831488
    sum(wpt_unknown$local_population_1km)
    [1] 31418651

    3.4.5.11 add wpt_unknown to wp_nga

    wp_nga <- wp_nga %>%
      mutate(`wp_unknown` = lengths(
        st_intersects(bdy_nga, wpt_unknown))) %>%
      mutate(`pct_unknown` = (`wp_unknown`/`total_wp`*100))

    -- replace “NaN” with 0

    wp_nga <- wp_nga %>%
      mutate(`pct_unknown` = replace_na(pct_unknown, 0))
    
    summary(wp_nga)
      shapeName                  geometry      total_wp     wp_functional   
     Length:774         MULTIPOLYGON :774   Min.   :  0.0   Min.   :  0.00  
     Class :character   epsg:4326    :  0   1st Qu.: 45.0   1st Qu.: 17.00  
     Mode  :character   +proj=long...:  0   Median : 96.0   Median : 45.50  
                                            Mean   :122.7   Mean   : 67.36  
                                            3rd Qu.:168.8   3rd Qu.: 87.75  
                                            Max.   :894.0   Max.   :752.00  
     pct_functional   wp_nonFunctional pct_nonFunctional   wp_unknown    
     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00    Min.   :  0.00  
     1st Qu.: 32.61   1st Qu.: 12.00   1st Qu.: 20.77    1st Qu.:  0.00  
     Median : 47.41   Median : 33.50   Median : 34.89    Median :  0.00  
     Mean   : 49.84   Mean   : 41.37   Mean   : 35.58    Mean   : 13.76  
     3rd Qu.: 66.99   3rd Qu.: 60.00   3rd Qu.: 50.00    3rd Qu.: 17.75  
     Max.   :100.00   Max.   :278.00   Max.   :100.00    Max.   :219.00  
      pct_unknown    
     Min.   :  0.00  
     1st Qu.:  0.00  
     Median :  0.00  
     Mean   : 12.55  
     3rd Qu.: 20.83  
     Max.   :100.00  

    3.4.5.12 visualise distribution :: “status_clean”

    Usage of the code chunk below :

    qtm( ) - tmap - to plot a thematic map quickly.

    tmap_arrange( ) - tmap - to arrange small multiples in grid layout.

    total_wp <- qtm(wp_nga,"total_wp")+
      tm_layout(legend.height = 0.3, legend.width = 0.5)
    
    wp_functional <- qtm(wp_nga,"wp_functional")+
      tm_layout(legend.height = 0.3, legend.width = 0.5)
    
    wp_nonFunctional <- qtm(wp_nga,"wp_nonFunctional")+
      tm_layout(legend.height = 0.3, legend.width = 0.5)
    
    wp_unknown <- qtm(wp_nga,"wp_unknown")+
      tm_layout(legend.height = 0.3, legend.width = 0.5)
    
    tmap_arrange(total_wp, wp_functional, wp_nonFunctional, wp_unknown, asp=0, ncol = 2, nrow = 2, widths = 5, heights = 10, sync = TRUE)

    3.4.5.13 extract “water_tech_category” to wp_nga

    freq(wp_joined1$water_tech_category, sort = "dec")
                        n    % val%
    Hand Pump       58755 61.8 61.8
    Mechanized Pump 25644 27.0 27.0
    Unknown         10055 10.6 10.6
    Tapstand          553  0.6  0.6
    Rope and Bucket     1  0.0  0.0

    Remarks :

    Only “Hand Pump”, “Mechanized Pump”, and “Tapstand” are to be extracted for further analysis as the rest are either less than 0.5% or “Unknown”.

    wtc_hPump <- wp_joined1 %>%
      filter(water_tech_category %in%
               "Hand Pump")
    
    wtc_mPump <- wp_joined1 %>%
      filter(water_tech_category %in%
               "Mechanized Pump")
    
    wtc_tStand <- wp_joined1 %>%
      filter(water_tech_category %in%
               "Tapstand")
    
    wp_nga <- wp_nga %>%
      mutate(`total_handPump` = lengths(
        st_intersects(bdy_nga, wtc_hPump)
      )) %>%
      mutate(`total_mechPump` = lengths(
        st_intersects(bdy_nga, wtc_mPump)
      )) %>%
        mutate(`total_tapStand` = lengths(
        st_intersects(bdy_nga, wtc_tStand)
      )) %>%
      mutate(`pct_handPump` = (`total_handPump`/`total_wp`*100)) %>%
      mutate(`pct_mechPump` = (`total_mechPump`/`total_wp`*100)) %>%
      mutate(`pct_tapStand` = (`total_tapStand`/`total_wp`*100))

    -- replace “NaN” with 0

    wp_nga <- wp_nga %>%
      mutate(`pct_handPump` = replace_na(pct_handPump, 0)) %>%
      mutate(`pct_mechPump` = replace_na(pct_mechPump, 0)) %>%
      mutate(`pct_tapStand` = replace_na(pct_tapStand, 0))
    
    summary(wp_nga)
      shapeName                  geometry      total_wp     wp_functional   
     Length:774         MULTIPOLYGON :774   Min.   :  0.0   Min.   :  0.00  
     Class :character   epsg:4326    :  0   1st Qu.: 45.0   1st Qu.: 17.00  
     Mode  :character   +proj=long...:  0   Median : 96.0   Median : 45.50  
                                            Mean   :122.7   Mean   : 67.36  
                                            3rd Qu.:168.8   3rd Qu.: 87.75  
                                            Max.   :894.0   Max.   :752.00  
     pct_functional   wp_nonFunctional pct_nonFunctional   wp_unknown    
     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00    Min.   :  0.00  
     1st Qu.: 32.61   1st Qu.: 12.00   1st Qu.: 20.77    1st Qu.:  0.00  
     Median : 47.41   Median : 33.50   Median : 34.89    Median :  0.00  
     Mean   : 49.84   Mean   : 41.37   Mean   : 35.58    Mean   : 13.76  
     3rd Qu.: 66.99   3rd Qu.: 60.00   3rd Qu.: 50.00    3rd Qu.: 17.75  
     Max.   :100.00   Max.   :278.00   Max.   :100.00    Max.   :219.00  
      pct_unknown     total_handPump   total_mechPump   total_tapStand   
     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000  
     1st Qu.:  0.00   1st Qu.:  6.00   1st Qu.: 11.00   1st Qu.: 0.0000  
     Median :  0.00   Median : 47.00   Median : 25.50   Median : 0.0000  
     Mean   : 12.55   Mean   : 75.89   Mean   : 33.12   Mean   : 0.7145  
     3rd Qu.: 20.83   3rd Qu.:111.00   3rd Qu.: 46.00   3rd Qu.: 0.0000  
     Max.   :100.00   Max.   :764.00   Max.   :245.00   Max.   :42.0000  
      pct_handPump     pct_mechPump     pct_tapStand    
     Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000  
     1st Qu.: 16.70   1st Qu.: 12.20   1st Qu.: 0.0000  
     Median : 50.99   Median : 31.27   Median : 0.0000  
     Mean   : 48.73   Mean   : 37.54   Mean   : 0.5794  
     3rd Qu.: 77.78   3rd Qu.: 57.71   3rd Qu.: 0.0000  
     Max.   :100.00   Max.   :100.00   Max.   :32.8947  

    3.4.5.14 visualise wp_nga distribution :: “water_tech_category”

    handPump <- tm_shape(bdy_nga)+
      tm_polygons(alpha = 0.1) +
    tm_shape(wp_nga) +  
      tm_dots(col = "pct_handPump",
              border.col = "gray60",
              border.lwd = 0.5) +
      tm_view(set.zoom.limits = c(5,9))
    
    mechPump <- tm_shape(bdy_nga)+
      tm_polygons(alpha = 0.1) +
    tm_shape(wp_nga) +  
      tm_dots(col = "pct_mechPump",
              border.col = "gray60",
              border.lwd = 0.5) +
      tm_view(set.zoom.limits = c(5,9))
    
    tapStand <- tm_shape(bdy_nga)+
      tm_polygons(alpha = 0.1) +
    tm_shape(wp_nga) +  
      tm_dots(col = "pct_tapStand",
              border.col = "gray60",
              border.lwd = 0.5) +
      tm_view(set.zoom.limits = c(5,9))
    
    tmap_arrange(handPump, mechPump, tapStand,
                 asp=1, 
                 ncol=2,
                 sync = TRUE)

    3.4.5.15 extract “usage_capacity” to wp_nga

    freq(wp_joined1$usage_capacity, sort = "dec")
             n    % val%
    300  68789 72.4 72.4
    1000 25644 27.0 27.0
    250    573  0.6  0.6
    50       2  0.0  0.0

    Remarks :

    • Only “300”, “1000”, and “250” are to be extracted for further analysis as the rest are either less than 0.5% or “Unknown”.

    • But, “50” will be included in the new variable “total_ucN1000” as part of the none ‘1000’ “usage_capacity” value.

    uCap_300 <- wp_joined1 %>%
      filter(usage_capacity %in%
               "300")
    
    uCap_1000 <- wp_joined1 %>%
      filter(usage_capacity %in%
               "1000")
    
    uCap_250 <- wp_joined1 %>%
      filter(usage_capacity %in%
               "250")
    
    uCap_50 <- wp_joined1 %>%
      filter(usage_capacity %in%
               "50")
    
    wp_nga <- wp_nga %>%
      mutate(`total_uc300` = lengths(
        st_intersects(bdy_nga, uCap_300)
      )) %>%
      mutate(`total_uc1000` = lengths(
        st_intersects(bdy_nga, uCap_1000)
      )) %>%
      mutate(`total_uc250` = lengths(
        st_intersects(bdy_nga, uCap_250)
      )) %>%
      mutate(`total_uc50` = lengths(
        st_intersects(bdy_nga, uCap_50)
      )) %>%
      mutate(`total_ucN1000` = ((lengths(
        st_intersects(
          bdy_nga, uCap_300))) + (lengths(
            st_intersects(
              bdy_nga, uCap_250))) + (lengths(
                st_intersects(
                  bdy_nga, uCap_50))))
        )%>%
               
      mutate(`pct_ucN1000` = (`total_ucN1000`/`total_wp`*100)) %>%
      mutate(`pct_uc300` = (`total_uc300`/`total_wp`*100)) %>%
      mutate(`pct_uc1000` = (`total_uc1000`/`total_wp`*100)) %>%
      mutate(`pct_uc250` = (`total_uc250`/`total_wp`*100))

    -- replace “NaN” with 0

    wp_nga <- wp_nga %>%
      mutate(`pct_ucN1000` = replace_na(pct_ucN1000, 0)) %>%
      mutate(`pct_uc300` = replace_na(pct_uc300, 0)) %>%
      mutate(`pct_uc1000` = replace_na(pct_uc1000, 0)) %>%
      mutate(`pct_uc250` = replace_na(pct_uc250, 0))
    
    summary(wp_nga)
      shapeName                  geometry      total_wp     wp_functional   
     Length:774         MULTIPOLYGON :774   Min.   :  0.0   Min.   :  0.00  
     Class :character   epsg:4326    :  0   1st Qu.: 45.0   1st Qu.: 17.00  
     Mode  :character   +proj=long...:  0   Median : 96.0   Median : 45.50  
                                            Mean   :122.7   Mean   : 67.36  
                                            3rd Qu.:168.8   3rd Qu.: 87.75  
                                            Max.   :894.0   Max.   :752.00  
     pct_functional   wp_nonFunctional pct_nonFunctional   wp_unknown    
     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00    Min.   :  0.00  
     1st Qu.: 32.61   1st Qu.: 12.00   1st Qu.: 20.77    1st Qu.:  0.00  
     Median : 47.41   Median : 33.50   Median : 34.89    Median :  0.00  
     Mean   : 49.84   Mean   : 41.37   Mean   : 35.58    Mean   : 13.76  
     3rd Qu.: 66.99   3rd Qu.: 60.00   3rd Qu.: 50.00    3rd Qu.: 17.75  
     Max.   :100.00   Max.   :278.00   Max.   :100.00    Max.   :219.00  
      pct_unknown     total_handPump   total_mechPump   total_tapStand   
     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000  
     1st Qu.:  0.00   1st Qu.:  6.00   1st Qu.: 11.00   1st Qu.: 0.0000  
     Median :  0.00   Median : 47.00   Median : 25.50   Median : 0.0000  
     Mean   : 12.55   Mean   : 75.89   Mean   : 33.12   Mean   : 0.7145  
     3rd Qu.: 20.83   3rd Qu.:111.00   3rd Qu.: 46.00   3rd Qu.: 0.0000  
     Max.   :100.00   Max.   :764.00   Max.   :245.00   Max.   :42.0000  
      pct_handPump     pct_mechPump     pct_tapStand      total_uc300    
     Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000   Min.   :  0.00  
     1st Qu.: 16.70   1st Qu.: 12.20   1st Qu.: 0.0000   1st Qu.: 15.25  
     Median : 50.99   Median : 31.27   Median : 0.0000   Median : 59.00  
     Mean   : 48.73   Mean   : 37.54   Mean   : 0.5794   Mean   : 88.85  
     3rd Qu.: 77.78   3rd Qu.: 57.71   3rd Qu.: 0.0000   3rd Qu.:126.75  
     Max.   :100.00   Max.   :100.00   Max.   :32.8947   Max.   :767.00  
      total_uc1000     total_uc250        total_uc50       total_ucN1000   
     Min.   :  0.00   Min.   : 0.0000   Min.   :0.000000   Min.   :  0.00  
     1st Qu.: 11.00   1st Qu.: 0.0000   1st Qu.:0.000000   1st Qu.: 16.00  
     Median : 25.50   Median : 0.0000   Median :0.000000   Median : 60.00  
     Mean   : 33.12   Mean   : 0.7403   Mean   :0.002584   Mean   : 89.59  
     3rd Qu.: 46.00   3rd Qu.: 0.0000   3rd Qu.:0.000000   3rd Qu.:127.75  
     Max.   :245.00   Max.   :42.0000   Max.   :1.000000   Max.   :767.00  
      pct_ucN1000       pct_uc300        pct_uc1000       pct_uc250      
     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000  
     1st Qu.: 39.68   1st Qu.: 38.67   1st Qu.: 12.20   1st Qu.: 0.0000  
     Median : 67.03   Median : 65.91   Median : 31.27   Median : 0.0000  
     Mean   : 60.78   Mean   : 60.17   Mean   : 37.54   Mean   : 0.6114  
     3rd Qu.: 87.35   3rd Qu.: 87.02   3rd Qu.: 57.71   3rd Qu.: 0.0000  
     Max.   :100.00   Max.   :100.00   Max.   :100.00   Max.   :32.8947  

    3.4.5.16 visualise wp_nga distribution :: “usage_capacity”

    uc300 <- tm_shape(bdy_nga)+
      tm_polygons(alpha = 0.1) +
    tm_shape(wp_nga) +  
      tm_dots(col = "pct_uc300",
              border.col = "gray60",
              border.lwd = 0.5) +
      tm_view(set.zoom.limits = c(5,9))
    
    uc1000 <- tm_shape(bdy_nga)+
      tm_polygons(alpha = 0.1) +
    tm_shape(wp_nga) +  
      tm_dots(col = "pct_uc1000",
              border.col = "gray60",
              border.lwd = 0.5) +
      tm_view(set.zoom.limits = c(5,9))
    
    uc250 <- tm_shape(bdy_nga)+
      tm_polygons(alpha = 0.1) +
    tm_shape(wp_nga) +  
      tm_dots(col = "pct_uc250",
              border.col = "gray60",
              border.lwd = 0.5) +
      tm_view(set.zoom.limits = c(5,9))
    
    tmap_arrange(uc300, uc1000, uc250,
                 asp=1, 
                 ncol=2,
                 sync = TRUE)

    3.4.5.17 extract “is_urban” to wp_nga

    urban_1 <- wp_joined1 %>%
      filter(is_urban %in%
               "TRUE")
    
    urban_0 <- wp_joined1 %>%
      filter(is_urban %in%
               "FALSE")
    
    wp_nga <- wp_nga %>%
      mutate(`total_urban1` = lengths(
        st_intersects(bdy_nga, urban_1)
      )) %>%
      mutate(`total_urban0` = lengths(
        st_intersects(bdy_nga, urban_0)
      )) %>%
      mutate(`pct_urban1` = (`total_urban1`/`total_wp`*100)) %>%
      mutate(`pct_urban0` = (`total_urban0`/`total_wp`*100))

    -- replace “NaN” with 0

    wp_nga <- wp_nga %>%
      mutate(`pct_urban1` = replace_na(pct_urban1, 0)) %>%
      mutate(`pct_urban0` = replace_na(pct_urban0, 0))
    
    summary(wp_nga)
      shapeName                  geometry      total_wp     wp_functional   
     Length:774         MULTIPOLYGON :774   Min.   :  0.0   Min.   :  0.00  
     Class :character   epsg:4326    :  0   1st Qu.: 45.0   1st Qu.: 17.00  
     Mode  :character   +proj=long...:  0   Median : 96.0   Median : 45.50  
                                            Mean   :122.7   Mean   : 67.36  
                                            3rd Qu.:168.8   3rd Qu.: 87.75  
                                            Max.   :894.0   Max.   :752.00  
     pct_functional   wp_nonFunctional pct_nonFunctional   wp_unknown    
     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00    Min.   :  0.00  
     1st Qu.: 32.61   1st Qu.: 12.00   1st Qu.: 20.77    1st Qu.:  0.00  
     Median : 47.41   Median : 33.50   Median : 34.89    Median :  0.00  
     Mean   : 49.84   Mean   : 41.37   Mean   : 35.58    Mean   : 13.76  
     3rd Qu.: 66.99   3rd Qu.: 60.00   3rd Qu.: 50.00    3rd Qu.: 17.75  
     Max.   :100.00   Max.   :278.00   Max.   :100.00    Max.   :219.00  
      pct_unknown     total_handPump   total_mechPump   total_tapStand   
     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000  
     1st Qu.:  0.00   1st Qu.:  6.00   1st Qu.: 11.00   1st Qu.: 0.0000  
     Median :  0.00   Median : 47.00   Median : 25.50   Median : 0.0000  
     Mean   : 12.55   Mean   : 75.89   Mean   : 33.12   Mean   : 0.7145  
     3rd Qu.: 20.83   3rd Qu.:111.00   3rd Qu.: 46.00   3rd Qu.: 0.0000  
     Max.   :100.00   Max.   :764.00   Max.   :245.00   Max.   :42.0000  
      pct_handPump     pct_mechPump     pct_tapStand      total_uc300    
     Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000   Min.   :  0.00  
     1st Qu.: 16.70   1st Qu.: 12.20   1st Qu.: 0.0000   1st Qu.: 15.25  
     Median : 50.99   Median : 31.27   Median : 0.0000   Median : 59.00  
     Mean   : 48.73   Mean   : 37.54   Mean   : 0.5794   Mean   : 88.85  
     3rd Qu.: 77.78   3rd Qu.: 57.71   3rd Qu.: 0.0000   3rd Qu.:126.75  
     Max.   :100.00   Max.   :100.00   Max.   :32.8947   Max.   :767.00  
      total_uc1000     total_uc250        total_uc50       total_ucN1000   
     Min.   :  0.00   Min.   : 0.0000   Min.   :0.000000   Min.   :  0.00  
     1st Qu.: 11.00   1st Qu.: 0.0000   1st Qu.:0.000000   1st Qu.: 16.00  
     Median : 25.50   Median : 0.0000   Median :0.000000   Median : 60.00  
     Mean   : 33.12   Mean   : 0.7403   Mean   :0.002584   Mean   : 89.59  
     3rd Qu.: 46.00   3rd Qu.: 0.0000   3rd Qu.:0.000000   3rd Qu.:127.75  
     Max.   :245.00   Max.   :42.0000   Max.   :1.000000   Max.   :767.00  
      pct_ucN1000       pct_uc300        pct_uc1000       pct_uc250      
     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000  
     1st Qu.: 39.68   1st Qu.: 38.67   1st Qu.: 12.20   1st Qu.: 0.0000  
     Median : 67.03   Median : 65.91   Median : 31.27   Median : 0.0000  
     Mean   : 60.78   Mean   : 60.17   Mean   : 37.54   Mean   : 0.6114  
     3rd Qu.: 87.35   3rd Qu.: 87.02   3rd Qu.: 57.71   3rd Qu.: 0.0000  
     Max.   :100.00   Max.   :100.00   Max.   :100.00   Max.   :32.8947  
      total_urban1     total_urban0      pct_urban1       pct_urban0    
     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
     1st Qu.:  0.00   1st Qu.: 23.00   1st Qu.:  0.00   1st Qu.: 57.27  
     Median :  9.00   Median : 64.00   Median : 11.95   Median : 86.45  
     Mean   : 25.27   Mean   : 97.45   Mean   : 25.61   Mean   : 72.71  
     3rd Qu.: 33.00   3rd Qu.:141.00   3rd Qu.: 38.44   3rd Qu.:100.00  
     Max.   :324.00   Max.   :894.00   Max.   :100.00   Max.   :100.00  

    3.4.5.18 visualise wp_nga distribution :: “is_urban”

    urban1 <- tm_shape(bdy_nga)+
      tm_polygons(alpha = 0.1) +
    tm_shape(wp_nga) +  
      tm_dots(col = "pct_urban1",
              border.col = "gray60",
              border.lwd = 0.5) +
      tm_view(set.zoom.limits = c(5,9))
    
    urban0 <- tm_shape(bdy_nga)+
      tm_polygons(alpha = 0.1) +
    tm_shape(wp_nga) +  
      tm_dots(col = "pct_urban0",
              border.col = "gray60",
              border.lwd = 0.5) +
      tm_view(set.zoom.limits = c(5,9))
    
    tmap_arrange(urban1, urban0,
                 asp=1, 
                 ncol=2,
                 sync = TRUE)


    3.4.5.19 save and read RDS File :: wp_nga

    write_rds(wp_nga,"data/geodata/wp_nga.rds")
    wp_nga <- read_rds("data/geodata/wp_nga.rds")

    3.4.5.20 transform to Projected Coordinate System

    Usage of the code chunk below :

    st_crs( ) - sf - to inspect the coordinate reference system.

    st_crs(wp_nga)
    Coordinate Reference System:
      User input: EPSG:4326 
      wkt:
    GEOGCRS["WGS 84",
        ENSEMBLE["World Geodetic System 1984 ensemble",
            MEMBER["World Geodetic System 1984 (Transit)"],
            MEMBER["World Geodetic System 1984 (G730)"],
            MEMBER["World Geodetic System 1984 (G873)"],
            MEMBER["World Geodetic System 1984 (G1150)"],
            MEMBER["World Geodetic System 1984 (G1674)"],
            MEMBER["World Geodetic System 1984 (G1762)"],
            MEMBER["World Geodetic System 1984 (G2139)"],
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ENSEMBLEACCURACY[2.0]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        CS[ellipsoidal,2],
            AXIS["geodetic latitude (Lat)",north,
                ORDER[1],
                ANGLEUNIT["degree",0.0174532925199433]],
            AXIS["geodetic longitude (Lon)",east,
                ORDER[2],
                ANGLEUNIT["degree",0.0174532925199433]],
        USAGE[
            SCOPE["Horizontal component of 3D system."],
            AREA["World."],
            BBOX[-90,-180,90,180]],
        ID["EPSG",4326]]

    Remarks :

    The EPSG for wp_nga is 4326, which is WGS 84. To compute the proximity distance matrix for clustering analysis, this coordinate reference system needs to transform into EPSG: 26391.

    Usage of the code chunk below :

    st_set_crs( ) - sf - to update the coordinate reference system.

    wp_ngaTrans <- st_set_crs(wp_nga, 26391)
    Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
    that
    bdy_ngaTrans <- st_set_crs(bdy_nga, 26391)
    Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
    that

    -- review CRS :: wp_ngaTrans

    st_crs(wp_ngaTrans)
    Coordinate Reference System:
      User input: EPSG:26391 
      wkt:
    PROJCRS["Minna / Nigeria West Belt",
        BASEGEOGCRS["Minna",
            DATUM["Minna",
                ELLIPSOID["Clarke 1880 (RGS)",6378249.145,293.465,
                    LENGTHUNIT["metre",1]]],
            PRIMEM["Greenwich",0,
                ANGLEUNIT["degree",0.0174532925199433]],
            ID["EPSG",4263]],
        CONVERSION["Nigeria West Belt",
            METHOD["Transverse Mercator",
                ID["EPSG",9807]],
            PARAMETER["Latitude of natural origin",4,
                ANGLEUNIT["degree",0.0174532925199433],
                ID["EPSG",8801]],
            PARAMETER["Longitude of natural origin",4.5,
                ANGLEUNIT["degree",0.0174532925199433],
                ID["EPSG",8802]],
            PARAMETER["Scale factor at natural origin",0.99975,
                SCALEUNIT["unity",1],
                ID["EPSG",8805]],
            PARAMETER["False easting",230738.26,
                LENGTHUNIT["metre",1],
                ID["EPSG",8806]],
            PARAMETER["False northing",0,
                LENGTHUNIT["metre",1],
                ID["EPSG",8807]]],
        CS[Cartesian,2],
            AXIS["(E)",east,
                ORDER[1],
                LENGTHUNIT["metre",1]],
            AXIS["(N)",north,
                ORDER[2],
                LENGTHUNIT["metre",1]],
        USAGE[
            SCOPE["Engineering survey, topographic mapping."],
            AREA["Nigeria - onshore west of 6°30'E, onshore and offshore shelf."],
            BBOX[3.57,2.69,13.9,6.5]],
        ID["EPSG",26391]]

    -- review CRS :: bdy_ngaTrans

    st_crs(bdy_ngaTrans)
    Coordinate Reference System:
      User input: EPSG:26391 
      wkt:
    PROJCRS["Minna / Nigeria West Belt",
        BASEGEOGCRS["Minna",
            DATUM["Minna",
                ELLIPSOID["Clarke 1880 (RGS)",6378249.145,293.465,
                    LENGTHUNIT["metre",1]]],
            PRIMEM["Greenwich",0,
                ANGLEUNIT["degree",0.0174532925199433]],
            ID["EPSG",4263]],
        CONVERSION["Nigeria West Belt",
            METHOD["Transverse Mercator",
                ID["EPSG",9807]],
            PARAMETER["Latitude of natural origin",4,
                ANGLEUNIT["degree",0.0174532925199433],
                ID["EPSG",8801]],
            PARAMETER["Longitude of natural origin",4.5,
                ANGLEUNIT["degree",0.0174532925199433],
                ID["EPSG",8802]],
            PARAMETER["Scale factor at natural origin",0.99975,
                SCALEUNIT["unity",1],
                ID["EPSG",8805]],
            PARAMETER["False easting",230738.26,
                LENGTHUNIT["metre",1],
                ID["EPSG",8806]],
            PARAMETER["False northing",0,
                LENGTHUNIT["metre",1],
                ID["EPSG",8807]]],
        CS[Cartesian,2],
            AXIS["(E)",east,
                ORDER[1],
                LENGTHUNIT["metre",1]],
            AXIS["(N)",north,
                ORDER[2],
                LENGTHUNIT["metre",1]],
        USAGE[
            SCOPE["Engineering survey, topographic mapping."],
            AREA["Nigeria - onshore west of 6°30'E, onshore and offshore shelf."],
            BBOX[3.57,2.69,13.9,6.5]],
        ID["EPSG",26391]]

    3.5 Exploratory Data Analysis

    3.5.1 Identify Outliers

    3.5.1.1 plot boxplot “pct_functional”

    ggplot(data=wp_ngaTrans, 
           aes(x=`pct_functional`)) +
      geom_boxplot(color="black", 
                   fill="#543005")

    3.5.1.2 plot boxplot “pct_nonFunctional”

    ggplot(data=wp_ngaTrans, 
           aes(x=`pct_nonFunctional`)) +
      geom_boxplot(color="black", 
                   fill="#C16622FF")

    3.5.1.3 plot boxplot “pct_unknown”

    ggplot(data=wp_ngaTrans, 
           aes(x=`pct_unknown`)) +
      geom_boxplot(color="black", 
                   fill="#FFA319FF")

    Remarks :

    Among these 3 key categories of “status_clean”, “unknown” has the most outliers.

    3.5.2 Multi-plot Histogram

    3.5.2.1 plot histogram for “status_clean”

    pctFunctional <- ggplot(data = wp_ngaTrans,
                             aes(x = `pct_functional`))+
      geom_histogram(bins=10,
                     colour = "black",
                     fill = "#543005")
    
    pctNonFunctional <- ggplot(data = wp_ngaTrans,
                             aes(x = `pct_nonFunctional`))+
      geom_histogram(bins=10,
                     colour = "black",
                     fill = "#C16622FF")
    
    pctUnknown <- ggplot(data = wp_ngaTrans,
                         aes(x = `pct_unknown`))+
      geom_histogram(bins = 10,
                     colour = "black",
                     fill = "#FFA319FF")
    ggarrange(pctFunctional,pctNonFunctional,pctUnknown,
              ncol = 2,
              nrow = 2)
    adding dummy grobs

    4. CORRELATION ANALYSIS

    4.1 Create Data Table for Correlation Matrix Analysis

    cluster_vars <- wp_ngaTrans %>%
      st_set_geometry(NULL) %>%
      select("shapeName",
             "pct_functional", 
             "pct_nonFunctional",
             "pct_unknown", 
             "pct_handPump",
             "pct_mechPump",
             "pct_tapStand",
             "pct_uc300",
             "pct_uc1000",
             "pct_ucN1000",
             "pct_uc250",
             "pct_urban0")
    head(cluster_vars,5)
      shapeName pct_functional pct_nonFunctional pct_unknown pct_handPump
    1 Aba North       41.17647          52.94118    5.882353    11.764706
    2 Aba South       40.84507          46.47887    9.859155     9.859155
    3    Abadam        0.00000           0.00000    0.000000     0.000000
    4     Abaji       40.35088          59.64912    0.000000    40.350877
    5      Abak       47.91667          50.00000    0.000000     8.333333
      pct_mechPump pct_tapStand pct_uc300 pct_uc1000 pct_ucN1000 pct_uc250
    1     82.35294            0 17.647059   82.35294   17.647059         0
    2     87.32394            0 12.676056   87.32394   12.676056         0
    3      0.00000            0  0.000000    0.00000    0.000000         0
    4     59.64912            0 40.350877   59.64912   40.350877         0
    5     91.66667            0  8.333333   91.66667    8.333333         0
      pct_urban0
    1   0.000000
    2   5.633803
    3   0.000000
    4  84.210526
    5  83.333333

    4.2 Visualise Correlation Matrix

    Usage of the code chunk below :

    corrplot.mixed( ) - corrplot - to use mixed methods to visualise a correlation matrix.

    This plot allows to identify the pattern and the relationship in the matrix.

    corrplot.mixed((cor(cluster_vars[,2:12])),
                   upper = "number",
                   lower = "ellipse",
                   tl.col = "black",
                   diag = "l",
                   tl.pos = "lt")

    Remarks :

    Following are the pairs with strong correlation :

    correlation coefficients variable_1 variable_2
    1.00 pct_mechPump pct_uc1000
    0.99 pct_tapStand pct_uc250
    0.99 pct_uc300 pct_ucN1000
    -0.91 pct_mechPump pct_ucN1000
    -0.91 pct_uc1000 pct_ucN1000
    -0.90 pct_mechPump pct_uc300
    -0.90 pct_uc300 pct_uc1000

    4.2.1 Replace Row ID with “shapeName”

    row.names(cluster_vars) <- cluster_vars$shapeName

    4.2.2 Trim High Correlation Variable and “shapeName”

    cluster_varsTrim <- cluster_vars %>%
      select(-shapeName, -pct_ucN1000, -pct_mechPump)

    4.2.2.1 review trimmed data table

    summary(cluster_varsTrim)
     pct_functional   pct_nonFunctional  pct_unknown      pct_handPump   
     Min.   :  0.00   Min.   :  0.00    Min.   :  0.00   Min.   :  0.00  
     1st Qu.: 32.61   1st Qu.: 20.77    1st Qu.:  0.00   1st Qu.: 16.70  
     Median : 47.41   Median : 34.89    Median :  0.00   Median : 50.99  
     Mean   : 49.84   Mean   : 35.58    Mean   : 12.55   Mean   : 48.73  
     3rd Qu.: 66.99   3rd Qu.: 50.00    3rd Qu.: 20.83   3rd Qu.: 77.78  
     Max.   :100.00   Max.   :100.00    Max.   :100.00   Max.   :100.00  
      pct_tapStand       pct_uc300        pct_uc1000       pct_uc250      
     Min.   : 0.0000   Min.   :  0.00   Min.   :  0.00   Min.   : 0.0000  
     1st Qu.: 0.0000   1st Qu.: 38.67   1st Qu.: 12.20   1st Qu.: 0.0000  
     Median : 0.0000   Median : 65.91   Median : 31.27   Median : 0.0000  
     Mean   : 0.5794   Mean   : 60.17   Mean   : 37.54   Mean   : 0.6114  
     3rd Qu.: 0.0000   3rd Qu.: 87.02   3rd Qu.: 57.71   3rd Qu.: 0.0000  
     Max.   :32.8947   Max.   :100.00   Max.   :100.00   Max.   :32.8947  
       pct_urban0    
     Min.   :  0.00  
     1st Qu.: 57.27  
     Median : 86.45  
     Mean   : 72.71  
     3rd Qu.:100.00  
     Max.   :100.00  

    5. CLUSTERING ANALYSIS

    5.1 Hierarchy Clustering

    There are four (4) main steps :

    • compute proximity matrix.
    • assign data point to a cluster.
    • merge clusters based on similarity between clusters.
    • update the distance matrix.

    5.1.1 Standardise Data

    As shown in the 4.2.3.1, there are few variables with Max. different from others. Hence, standardisation will be required prior to further analysis.

    5.1.1.1 standardise with min-max method

    nga_wpStd <- normalize(cluster_varsTrim)
    summary(nga_wpStd)
     pct_functional   pct_nonFunctional  pct_unknown      pct_handPump   
     Min.   :0.0000   Min.   :0.0000    Min.   :0.0000   Min.   :0.0000  
     1st Qu.:0.3261   1st Qu.:0.2077    1st Qu.:0.0000   1st Qu.:0.1670  
     Median :0.4741   Median :0.3489    Median :0.0000   Median :0.5099  
     Mean   :0.4984   Mean   :0.3558    Mean   :0.1255   Mean   :0.4873  
     3rd Qu.:0.6699   3rd Qu.:0.5000    3rd Qu.:0.2083   3rd Qu.:0.7778  
     Max.   :1.0000   Max.   :1.0000    Max.   :1.0000   Max.   :1.0000  
      pct_tapStand       pct_uc300        pct_uc1000       pct_uc250      
     Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
     1st Qu.:0.00000   1st Qu.:0.3867   1st Qu.:0.1220   1st Qu.:0.00000  
     Median :0.00000   Median :0.6591   Median :0.3127   Median :0.00000  
     Mean   :0.01761   Mean   :0.6017   Mean   :0.3754   Mean   :0.01859  
     3rd Qu.:0.00000   3rd Qu.:0.8702   3rd Qu.:0.5771   3rd Qu.:0.00000  
     Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
       pct_urban0    
     Min.   :0.0000  
     1st Qu.:0.5727  
     Median :0.8645  
     Mean   :0.7271  
     3rd Qu.:1.0000  
     Max.   :1.0000  

    5.1.1.2 standardise with Z-score method

    nga_wpZ <- scale(cluster_varsTrim)
    describe(nga_wpZ)
                      vars   n mean sd median trimmed  mad   min   max range  skew
    pct_functional       1 774    0  1  -0.10   -0.02 1.04 -2.06  2.07  4.13  0.14
    pct_nonFunctional    2 774    0  1  -0.03   -0.02 1.05 -1.71  3.10  4.81  0.23
    pct_unknown          3 774    0  1  -0.62   -0.22 0.00 -0.62  4.30  4.92  2.01
    pct_handPump         4 774    0  1   0.07    0.00 1.37 -1.49  1.57  3.06 -0.09
    pct_tapStand         5 774    0  1  -0.19   -0.19 0.00 -0.19 10.46 10.65  7.22
    pct_uc300            6 774    0  1   0.19    0.08 1.10 -2.02  1.34  3.35 -0.56
    pct_uc1000           7 774    0  1  -0.21   -0.09 1.05 -1.28  2.14  3.42  0.61
    pct_uc250            8 774    0  1  -0.20   -0.20 0.00 -0.20 10.37 10.57  7.10
    pct_urban0           9 774    0  1   0.42    0.17 0.62 -2.23  0.84  3.06 -1.12
                      kurtosis   se
    pct_functional       -0.62 0.04
    pct_nonFunctional    -0.42 0.04
    pct_unknown           4.15 0.04
    pct_handPump         -1.33 0.04
    pct_tapStand         58.65 0.04
    pct_uc300            -0.87 0.04
    pct_uc1000           -0.78 0.04
    pct_uc250            57.14 0.04
    pct_urban0           -0.09 0.04

    5.1.1.3 visualise distribution of standardised clustering variable

    -- functional water point

    fwp <- ggplot(data=cluster_varsTrim, 
                 aes(x= `pct_functional`)) +
      geom_histogram(bins=20, 
                     color="black", 
                     fill="steelblue") +
      ggtitle("Before Standardisation")
    
    
    fwp_stdDf <- as.data.frame(nga_wpStd)
    fwp_std <- ggplot(data=fwp_stdDf, 
           aes(x=`pct_functional`)) +
      geom_histogram(bins=20, 
                     color="black", 
                     fill="steelblue") +
      ggtitle("Min-Max Stdsn.")
    
    fwp_zDf <- as.data.frame(nga_wpZ)
    fwp_z <- ggplot(data=fwp_zDf, 
           aes(x=`pct_functional`)) +
      geom_histogram(bins=20, 
                     color="black", 
                     fill="steelblue") +
      ggtitle("Z-score Stdsn.")
    
    ggarrange(fwp, fwp_std, fwp_z,
              ncol = 3,
              nrow = 1)

    fwp <- ggplot(data=cluster_varsTrim, 
                 aes(x= `pct_functional`)) +
      geom_density(color="black", 
                     fill="steelblue") +
      ggtitle("Before Standardisation")
    
    
    fwp_stdDf <- as.data.frame(nga_wpStd)
    fwp_std <- ggplot(data=fwp_stdDf, 
           aes(x=`pct_functional`)) +
      geom_density(color="black", 
                     fill="steelblue") +
      ggtitle("Min-Max Stdsn.")
    
    fwp_zDf <- as.data.frame(nga_wpZ)
    fwp_z <- ggplot(data=fwp_zDf, 
           aes(x=`pct_functional`)) +
      geom_density(color="black", 
                     fill="steelblue") +
      ggtitle("Z-score Stdsn.")
    
    ggarrange(fwp, fwp_std, fwp_z,
              ncol = 3,
              nrow = 1)

    -- water point deployed with handpump

    HP <- ggplot(data=cluster_varsTrim, 
                 aes(x= `pct_handPump`)) +
      geom_histogram(bins=20, 
                     color="black", 
                     fill="steelblue") +
      ggtitle("Before Standardisation")
    
    
    fwp_stdDf <- as.data.frame(nga_wpStd)
    HP_std <- ggplot(data=fwp_stdDf, 
           aes(x=`pct_handPump`)) +
      geom_histogram(bins=20, 
                     color="black", 
                     fill="steelblue") +
      ggtitle("Min-Max Stdsn.")
    
    fwp_zDf <- as.data.frame(nga_wpZ)
    HP_z <- ggplot(data=fwp_zDf, 
           aes(x=`pct_handPump`)) +
      geom_histogram(bins=20, 
                     color="black", 
                     fill="steelblue") +
      ggtitle("Z-score Stdsn.")
    
    ggarrange(HP, HP_std, HP_z,
              ncol = 3,
              nrow = 1)

    HP <- ggplot(data=cluster_varsTrim, 
                 aes(x= `pct_handPump`)) +
      geom_density(color="black", 
                     fill="steelblue") +
      ggtitle("Before Standardisation")
    
    
    fwp_stdDf <- as.data.frame(nga_wpStd)
    HP_std <- ggplot(data=fwp_stdDf, 
           aes(x=`pct_handPump`)) +
      geom_density(color="black", 
                     fill="steelblue") +
      ggtitle("Min-Max Stdsn.")
    
    fwp_zDf <- as.data.frame(nga_wpZ)
    HP_z <- ggplot(data=fwp_zDf, 
           aes(x=`pct_handPump`)) +
      geom_density(color="black", 
                     fill="steelblue") +
      ggtitle("Z-score Stdsn.")
    
    ggarrange(HP, HP_std, HP_z,
              ncol = 3,
              nrow = 1)

    -- water point with 1000 users usage capacity

    uc1000 <- ggplot(data=cluster_varsTrim, 
                 aes(x= `pct_uc1000`)) +
      geom_histogram(bins=20, 
                     color="black", 
                     fill="steelblue") +
      ggtitle("Before Standardisation")
    
    fwp_stdDf <- as.data.frame(nga_wpStd)
    uc1000_std <- ggplot(data=fwp_stdDf, 
           aes(x=`pct_uc1000`)) +
      geom_histogram(bins=20, 
                     color="black", 
                     fill="steelblue") +
      ggtitle("Min-Max Stdsn.")
    
    fwp_zDf <- as.data.frame(nga_wpZ)
    uc1000_z <- ggplot(data=fwp_zDf, 
           aes(x=`pct_uc1000`)) +
      geom_histogram(bins=20, 
                     color="black", 
                     fill="steelblue") +
      ggtitle("Z-score Stdsn.")
    
    ggarrange(uc1000, uc1000_std, uc1000_z,
              ncol = 3,
              nrow = 1)

    uc1000 <- ggplot(data=cluster_varsTrim, 
                 aes(x= `pct_uc1000`)) +
      geom_density(color="black", 
                     fill="steelblue") +
      ggtitle("Before Standardisation")
    
    
    fwp_stdDf <- as.data.frame(nga_wpStd)
    uc1000_std <- ggplot(data=fwp_stdDf, 
           aes(x=`pct_uc1000`)) +
      geom_density(color="black", 
                     fill="steelblue") +
      ggtitle("Min-Max Stdsn.")
    
    fwp_zDf <- as.data.frame(nga_wpZ)
    uc1000_z <- ggplot(data=fwp_zDf, 
           aes(x=`pct_uc1000`)) +
      geom_density(color="black", 
                     fill="steelblue") +
      ggtitle("Z-score Stdsn.")
    
    ggarrange(uc1000, uc1000_std, uc1000_z,
              ncol = 3,
              nrow = 1)

    5.1.2 Compute Proximity Matrix

    Usage of the code chunk below :

    dist( ) - stats - to compute the proximity distance matrix. Among euclidean, maximum, manhattan, canberra, binary and minkowski, euclidean is used to compute proxmat_euc.

    proxmat_euc <- dist(cluster_varsTrim, method = 'euclidean')

    5.1.3 Compute Hierarchical Clustering

    Usage of the code chunk below :

    hclust( ) - stats - to compute cluster with agglomeration method.

    ggdendrogram( ) - ggdendro - to plot dendrogram with tools available in ggplot2.

    hieClust_warD <- hclust(proxmat_euc, method = 'ward.D')
    ggdendrogram(hieClust_warD, 
                 rotate = TRUE, 
                 size = 2, 
                 theme_dendro = FALSE)

    5.1.4 Determine Optimal Clustering Algorithm

    Usage of the code chunk below :

    agnes( ) - cluster - to get agglomerative coefficient of 4 clustering structure, namely “average”, “single”, “complete” and “ward”.

    m <- c( "average", "single", "complete", "ward")
    
    names(m) <- c( "average", "single", "complete", "ward")
    
    ac <- function(x) {
      agnes(cluster_varsTrim, method = x)$ac
      }
    
    map_dbl(m, ac)
      average    single  complete      ward 
    0.9264460 0.8825086 0.9494033 0.9923235 

    Remarks :

    • Value 1 indicate strongest clustering structure.

    • Ward’s method provides the strongest clustering structure. Therefore, Ward’s method to be used in subsequent analysis.

    5.1.5 Determine Optimal Clusters

    To determine the optimal clusters to retain, following commons methods are tested :

    • Gap statistic

    • Elbow

    • Average Silhouette

    5.1.5.1 compute Gap Statistic method

    Usage of the code chunk below :

    clusGap( ) - cluster - to compute the gap statistic.

    set.seed(12345)
    gap_stat <- clusGap(cluster_varsTrim, 
                        FUN = hcut, 
                        nstart = 25, 
                        K.max = 30, 
                        B = 50)
    # Print the result
    print(gap_stat, method = "firstmax")
    Clustering Gap statistic ["clusGap"] from call:
    clusGap(x = cluster_varsTrim, FUNcluster = hcut, K.max = 30, B = 50, nstart = 25)
    B=50 simulated reference sets, k = 1..30; spaceH0="scaledPCA"
     --> Number of clusters (method 'firstmax'): 30
              logW    E.logW       gap      SE.sim
     [1,] 9.815280 10.315513 0.5002330 0.008043258
     [2,] 9.602465 10.203144 0.6006791 0.008510149
     [3,] 9.510126 10.144007 0.6338806 0.010041659
     [4,] 9.443727 10.094054 0.6503272 0.009408325
     [5,] 9.337773 10.055036 0.7172630 0.008377048
     [6,] 9.286151 10.021049 0.7348986 0.008061617
     [7,] 9.226030  9.992116 0.7660864 0.007509762
     [8,] 9.181055  9.967079 0.7860239 0.007406854
     [9,] 9.132662  9.944999 0.8123364 0.007505962
    [10,] 9.088576  9.924941 0.8363644 0.008003248
    [11,] 9.057435  9.906340 0.8489044 0.008086779
    [12,] 9.019733  9.888898 0.8691655 0.008378669
    [13,] 8.988798  9.872948 0.8841499 0.008499708
    [14,] 8.962331  9.857905 0.8955736 0.008609360
    [15,] 8.932159  9.843557 0.9113980 0.008574353
    [16,] 8.908477  9.830133 0.9216564 0.008442998
    [17,] 8.880801  9.817233 0.9364313 0.008235439
    [18,] 8.846775  9.805149 0.9583744 0.008068772
    [19,] 8.828254  9.793671 0.9654167 0.007975742
    [20,] 8.812263  9.782502 0.9702393 0.007904523
    [21,] 8.793736  9.771938 0.9782012 0.007903082
    [22,] 8.777957  9.761659 0.9837022 0.007927244
    [23,] 8.762944  9.751686 0.9887413 0.007838986
    [24,] 8.745719  9.741903 0.9961837 0.007850884
    [25,] 8.732706  9.732508 0.9998020 0.007792150
    [26,] 8.716858  9.723358 1.0064996 0.007813310
    [27,] 8.703095  9.714414 1.0113191 0.007684052
    [28,] 8.684688  9.705770 1.0210819 0.007586041
    [29,] 8.664250  9.697408 1.0331579 0.007592467
    [30,] 8.649964  9.689233 1.0392698 0.007607051

    -- visualise gap_stat

    Usage of the code chunk below :

    fviz_nbclust( ) - factoextra - to compute and visualise the Optimal Number of clusters.

    set.seed(12345)
    fviz_nbclust(nga_wpZ, 
                 kmeans, 
                 nstart = 25,  
                 method = "gap_stat", 
                 nboot = 50)+
      labs(subtitle = "Gap statistic method")
    Warning: did not converge in 10 iterations
    
    Warning: did not converge in 10 iterations
    
    Warning: did not converge in 10 iterations

    5.1.5.2 compute and visualise Elbow method

    fviz_nbclust(nga_wpZ, kmeans, method = "wss") +
        geom_vline(xintercept = 4, linetype = 2)+
      labs(subtitle = "Elbow method")

    5.1.5.3 compute and visualise Silhouette method

    fviz_nbclust(nga_wpZ, kmeans, method = "silhouette")+
      labs(subtitle = "Silhouette method")

    Remarks :

    Given the Elbow method, Silhouette method and Gap Statistic method, the 5-cluster by Silhouette method will be used for the rest of the study.

    5.1.5.4 interpret with Dendrogram

    Usage of the code chunk below :

    rect.hclust( ) - stats - to draw the dendrogram with a border around the selected clusters.

    plot(hieClust_warD, cex = 0.6)
    rect.hclust(hieClust_warD, 
                k = 5, 
                border = 2:5)

    5.1.6 Visually-Driven Hierarchical Clustering Analysis

    The data is loaded into a data frame, but it has to be a data matrix to plot the heatmap. Hence, the data frame will need to first transform into a matrix.

    5.1.6.1 transform data frame into matrix

    Usage of the code chunk below :

    data.matrix( ) - base - to transform cluster_varsTrim data frame into a data matrix, and named it as nga_clustMat.

    nga_clustMat <- data.matrix(cluster_varsTrim)

    5.1.6.2 plot interactive cluster heatmap

    Usage of the code chunk below :

    heatmaply( ) - heatmaply - to build an interactive cluster heatmap.

    heatmaply(normalize(nga_clustMat),
              Colv=NA,
              dist_method = "euclidean",
              hclust_method = "ward.D",
              seriate = "OLO",
              colors = Blues,
              k_row = 5,
              margins = c(NA,200,60,NA),
              fontsize_row = 4,
              fontsize_col = 5,
              main="Geographic Segmentation of Nigeria by Water Points",
              xlab = "Water Points",
              ylab = "Nigeria LGA"
              )

    Remarks :

    Based on the plot above, 5 clusters to be retained for further analysis.

    5.1.6.3 map the formed cluster

    Usage of the code chunk below :

    cutree( ) - base - to derive a 5-cluster model, and named the output as groups.

    groups <- as.factor(cutree(hieClust_warD, k=5))

    5.1.6.4 append groups to wp_ngaTrans

    nga_clust.sf <- cbind(wp_ngaTrans, as.matrix(groups)) %>%
      rename(`cluster`=`as.matrix.groups.`)

    5.1.6.5 plot choropleth map :: nga_clust.sf

    qtm(nga_clust.sf, "cluster")

    Remarks :

    The choropleth map above shows the fragmented clusters by the used of non-spatial clustering algorithm (hierarchical cluster analysis method).

    5.2 Spatially Constrained Clustering :: SKATER Approach

    SKATER function only support sp objects in SpatialPolygonDataFrame. Hence, the wp_ngaTrans has to first transform into SpatialPolygonDataFrame before proceed further.

    5.2.1 Convert SF to SP Data Frame

    Usage of the code chunk below :

    as_Spatial( ) - sf - to convert wp_ngaTrans into nga_sp in a SP data frame.

    nga.sp <- as_Spatial(wp_ngaTrans)

    5.2.2 Compute Neighbour List

    Usage of the code chunk below :

    poly2nb( ) - spdep - to compute the neighbours list from polygon list.

    nga.nb <- poly2nb(nga.sp, queen = TRUE)
    summary(nga.nb)
    Neighbour list object:
    Number of regions: 774 
    Number of nonzero links: 4440 
    Percentage nonzero weights: 0.7411414 
    Average number of links: 5.736434 
    1 region with no links:
    86
    Link number distribution:
    
      0   1   2   3   4   5   6   7   8   9  10  11  12  14 
      1   2  14  57 125 182 140 122  72  41  12   4   1   1 
    2 least connected regions:
    138 560 with 1 link
    1 most connected region:
    508 with 14 links

    Remarks :

    There is one (1) region, i.e. #86 is without link. It has to be removed first before proceed to plot the neighbours list.

    5.2.2.1 remove 0-neighbour region

    wp_ngaTrans1 <- wp_ngaTrans[-86,]
    cluster_varsTrim1 <- cluster_varsTrim[-86,]
    nga_clust.sf1 <- nga_clust.sf[-86,]
    nga_wpZ1 <- nga_wpZ[-86,]
    nga.sp1 <- as_Spatial(wp_ngaTrans1)
    
    nga.nb1 <- poly2nb(nga.sp1)
    summary(nga.nb1)
    Neighbour list object:
    Number of regions: 773 
    Number of nonzero links: 4440 
    Percentage nonzero weights: 0.7430602 
    Average number of links: 5.743855 
    Link number distribution:
    
      1   2   3   4   5   6   7   8   9  10  11  12  14 
      2  14  57 125 182 140 122  72  41  12   4   1   1 
    2 least connected regions:
    138 560 with 1 link
    1 most connected region:
    508 with 14 links

    5.2.2.2 plot Neighbour List by Centroid Node

    Usage of the code chunk below : plot the boundary first before the neighbour list object to avoid any region from being clipped away.

    plot(nga.sp1, 
         border=grey(.5))
    plot(nga.nb1, 
         coordinates(nga.sp1), 
         col="blue", 
         add=TRUE)

    5.2.3 Compute Minimum Spanning Tree (MST)

    5.2.3.1 calculate edge costs

    Usage of the code chunk below :

    nbcosts( ) - spdep - to compute the cost of each edge which is the distance between nodes.

    edge_cost <- nbcosts(nga.nb1, cluster_varsTrim1)

    5.2.3.2 specify spatial weight

    nb2listw( ) - spdep - to specify edge_cost as the spatial weights. Set the “style” to “B” to ensure the cost values are not row-standardised.

    nga.w <- nb2listw(nga.nb1,
                      edge_cost,
                      style = "B")
    Warning in nb2listw(nga.nb1, edge_cost, style = "B"): zero sum general weights
    summary(nga.w)
    Characteristics of weights list object:
    Neighbour list object:
    Number of regions: 773 
    Number of nonzero links: 4440 
    Percentage nonzero weights: 0.7430602 
    Average number of links: 5.743855 
    Link number distribution:
    
      1   2   3   4   5   6   7   8   9  10  11  12  14 
      2  14  57 125 182 140 122  72  41  12   4   1   1 
    2 least connected regions:
    138 560 with 1 link
    1 most connected region:
    508 with 14 links
    
    Weights style: B 
    Weights constants summary:
        n     nn       S0       S1        S2
    B 773 597529 245120.7 38463020 406298492

    5.2.3.3 compute minimum spanning tree

    Usage of the code chunk below :

    nbcosts( ) - spdep - to compute the minimum spanning tree.

    nga_minSpanT <- mstree(nga.w)

    -- review class and dimension of the computed MST

    class(nga_minSpanT)
    [1] "mst"    "matrix"
    dim(nga_minSpanT)
    [1] 772   3
    head(nga_minSpanT)
         [,1] [,2]      [,3]
    [1,]  474  387 134.66287
    [2,]  387  478  64.55618
    [3,]  387  439  78.95255
    [4,]  439  476  50.91751
    [5,]  439  270 105.68114
    [6,]  270   90  66.67241

    5.2.3.4 plot MST Neighbour List

    plot(nga.sp1, border=gray(.5))
    plot.mst(nga_minSpanT,
             coordinates(nga.sp1), 
             col="blue", 
             cex.lab=0.7, 
             cex.circles=0.005, 
             add=TRUE)

    5.2.4 Compute Spatially Constrained Cluster

    Usage of the code chunk below :

    skater( ) - spdep - to compute the spatially constrained cluster.

    clust5 <- spdep::skater(edges = nga_minSpanT[,1:2],
                            data = cluster_varsTrim1,
                            method = "euclidean",
                            ncuts = 4)
    str(clust5)
    List of 8
     $ groups      : num [1:773] 3 3 1 5 4 1 2 2 1 3 ...
     $ edges.groups:List of 5
      ..$ :List of 3
      .. ..$ node: num [1:309] 773 747 492 131 382 224 413 488 439 257 ...
      .. ..$ edge: num [1:308, 1:3] 131 382 224 413 257 767 439 704 476 75 ...
      .. ..$ ssw : num 17013
      ..$ :List of 3
      .. ..$ node: num [1:129] 597 315 316 557 195 571 339 744 205 213 ...
      .. ..$ edge: num [1:128, 1:3] 315 316 557 195 571 15 82 579 744 351 ...
      .. ..$ ssw : num 7874
      ..$ :List of 3
      .. ..$ node: num [1:85] 364 10 729 215 337 551 102 103 66 19 ...
      .. ..$ edge: num [1:84, 1:3] 23 536 578 103 19 375 727 617 188 103 ...
      .. ..$ ssw : num 4545
      ..$ :List of 3
      .. ..$ node: num [1:39] 550 202 330 287 374 732 537 586 733 201 ...
      .. ..$ edge: num [1:38, 1:3] 612 586 136 245 332 429 504 537 586 616 ...
      .. ..$ ssw : num 1294
      ..$ :List of 3
      .. ..$ node: num [1:211] 67 510 401 122 24 526 475 489 663 303 ...
      .. ..$ edge: num [1:210, 1:3] 67 549 510 119 639 401 556 122 693 70 ...
      .. ..$ ssw : num 10380
     $ not.prune   : NULL
     $ candidates  : int [1:5] 1 2 3 4 5
     $ ssto        : num 52660
     $ ssw         : num [1:5] 52660 48724 44268 42679 41106
     $ crit        : num [1:2] 1 Inf
     $ vec.crit    : num [1:773] 1 1 1 1 1 1 1 1 1 1 ...
     - attr(*, "class")= chr "skater"

    5.2.4.1 tabulate cluster assignment

    ccs5 <- clust5$groups
    table(ccs5)
    ccs5
      1   2   3   4   5 
    309 129  85  39 211 

    5.2.4.2 plot the pruned tree

    plot(nga.sp1, border=gray(.5))
    plot(clust5, 
         coordinates(nga.sp1), 
         cex.lab=.7,
         groups.colors=c("red","green","blue", "brown", "pink"),
         cex.circles=0.005, 
         add=TRUE)
    Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
    coords[id2, : "add" is not a graphical parameter
    
    Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
    coords[id2, : "add" is not a graphical parameter
    
    Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
    coords[id2, : "add" is not a graphical parameter
    
    Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
    coords[id2, : "add" is not a graphical parameter
    
    Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
    coords[id2, : "add" is not a graphical parameter

    5.2.5 Visualise SKATER Clusters in Choropleth Map

    groups_mat <- as.matrix(clust5$groups)
    
    nga_spClust.sf <- cbind(nga_clust.sf1, as.factor(groups_mat)) %>%
      rename(`sp_cluster`=`as.factor.groups_mat.`)

    To compare the output of hierarchical clustering and spatially constrained hierarchical clustering :

    hieClust_map <- qtm(nga_clust.sf1,
                      "cluster") + 
      tm_borders(alpha = 0.5) 
    
    ngaClust_map <- qtm(nga_spClust.sf,
                       "sp_cluster") + 
      tm_borders(alpha = 0.5) 
    
    tmap_arrange(hieClust_map, ngaClust_map,
                 asp=NA, ncol=2)
    Warning: One tm layer group has duplicated layer types, which are omitted. To
    draw multiple layers of the same type, use multiple layer groups (i.e. specify
    tm_shape prior to each of them).
    
    Warning: One tm layer group has duplicated layer types, which are omitted. To
    draw multiple layers of the same type, use multiple layer groups (i.e. specify
    tm_shape prior to each of them).

    5.3 Spatially Constrained Clustering :: ClustGeo Method

    5.3.1 Perform Ward-like Hierarchical Clustering

    Usage of the code chunk below :

    hclustgeo( ) - ClustGeo - to perform a typical Ward-like hierarchical clustering.

    proxmat_ngc <- dist(cluster_varsTrim1, method = 'euclidean')
    nonGeo_clust <- hclustgeo(proxmat_ngc)
    plot(nonGeo_clust, cex = 0.5)
    rect.hclust(nonGeo_clust, 
                k = 5, 
                border = 2:5)

    5.3.1.1 visualise the formed clusters

    groups_ngc <- as.factor(cutree(nonGeo_clust, k=5))
    nga_ngeo_clust.sf <- cbind(wp_ngaTrans1, as.matrix(groups_ngc)) %>%
      rename(`cluster` = `as.matrix.groups_ngc.`)
    qtm(nga_ngeo_clust.sf, "cluster")

    5.3.2 Perform Spatially Constrained Hierarchical Clustering

    Usage of the code chunk below :

    st_distance( ) - sf - to derive the spatial distance matrix before perform spatially constrained hierarchical clustering.

    as.dist( ) - stats - to convert the data frame into matrix.

    dist <- st_distance(wp_ngaTrans1, wp_ngaTrans1)
    dist_mat <- as.dist(dist)

    5.3.2.1 determine alpha value

    choicealpha( ) - psych - to determine a suitable value for the mixing parameter alpha.

    cr <- choicealpha(
      proxmat_ngc, 
      dist_mat, 
      range.alpha = seq(0, 1, 0.1), 
      K=5, 
      graph = TRUE)

    Remarks :

    With reference to the plot above, alpha = 0.4 to be used to perform spatially constrained hierarchical clustering.

    5.3.2.2 compute spatially constrained hierarchical clustering

    clustG <- hclustgeo(proxmat_ngc, 
                        dist_mat, 
                        alpha = 0.4)

    5.3.2.3 derive cluster object

    groups_cg <- as.factor(cutree(clustG, k=5))

    5.3.2.4 combine group_cg with wp_ngaTrans1

    wp_nga1_GClust <- cbind(wp_ngaTrans1, as.matrix(groups_cg)) %>%
      rename(`cluster` = `as.matrix.groups_cg.`)

    5.3.2.5 plot delineated spatially constrained cluster

    qtm(wp_nga1_GClust, "cluster")


    5.4 Visual Interpretation of Clusters

    5.4.1 Visualise Individual Clustering Variable

    5.4.1.1 plot boxplot

    ggplot(data = nga_ngeo_clust.sf,
           aes(x = cluster, y = pct_functional)) +
      geom_boxplot()

    Remarks :

    The boxplot reveals Cluster 5 displays the highest mean of functional water points. This is followed by Cluster 1, 3, 2, and 4.

    5.4.2 Visualise Multivariate

    Usage of the code chunk below :

    ggparcoord( ) - GGally - to reveal clustering variables by cluster.

    nga_ngeo_clust.sf1 <- nga_ngeo_clust.sf %>%
      select("shapeName", 
             "pct_functional", 
             "pct_nonFunctional", 
             "pct_unknown", 
             "pct_handPump", 
             "pct_tapStand", 
             "pct_uc300", 
             "pct_uc1000", 
             "pct_uc250", 
             "pct_urban0",
             "cluster")
             
    head(nga_ngeo_clust.sf1,3)
    Simple feature collection with 3 features and 11 fields
    Geometry type: MULTIPOLYGON
    Dimension:     XY
    Bounding box:  xmin: 7.307433 ymin: 5.052192 xmax: 13.83477 ymax: 13.71406
    Projected CRS: Minna / Nigeria West Belt
      shapeName pct_functional pct_nonFunctional pct_unknown pct_handPump
    1 Aba North       41.17647          52.94118    5.882353    11.764706
    2 Aba South       40.84507          46.47887    9.859155     9.859155
    3    Abadam        0.00000           0.00000    0.000000     0.000000
      pct_tapStand pct_uc300 pct_uc1000 pct_uc250 pct_urban0 cluster
    1            0  17.64706   82.35294         0   0.000000       1
    2            0  12.67606   87.32394         0   5.633803       1
    3            0   0.00000    0.00000         0   0.000000       1
                            geometry
    1 MULTIPOLYGON (((7.401109 5....
    2 MULTIPOLYGON (((7.334479 5....
    3 MULTIPOLYGON (((13.83477 13...
    ggparcoord(data = nga_ngeo_clust.sf,
               columns = c(2:19),
               scale = "globalminmax",
               alphaLines = 0.2,
               boxplot = TRUE, 
               title = "Multiple Parallel Coordinates Plots of Variables by Cluster") +
      facet_grid(~ cluster, scales = "fixed") + 
      theme(axis.text.x = element_text(angle = 30))

    The parallel coordinate plot above reveals that households in Cluster 4 townships tend to own the highest number of TV and mobile-phone. On the other hand, households in Cluster 5 tends to own the lowest of all the five ICT.

    Note that the scale argument of ggparcoor() provide several methods to scale the clustering variables. They are:

    • std: univariately, subtract mean and divide by standard deviation.

    • robust: univariately, subtract median and divide by median absolute deviation.

    • uniminmax: univariately, scale so the minimum of the variable is zero, and the maximum is one.

    • globalminmax: no scaling is done; the range of the graphs is defined by the global minimum and the global maximum.

    • center: use uniminmax to standardize vertical height, then center each variable at a value specified by the scaleSummary param.

    • centerObs: use uniminmax to standardize vertical height, then center each variable at the value of the observation specified by the centerObsID param

    5.4.3 Compute Summary Statistics

    nga_ngeo_clust.sf %>% 
      st_set_geometry(NULL) %>%
      group_by(cluster) %>%
      summarise(mean_pct_functional = mean(pct_functional),
                mean_pct_nonFunctional = mean(pct_nonFunctional),
                mean_pct_unknown = mean(pct_unknown),
                mean_pct_handPump = mean(pct_handPump), 
                mean_pct_tapStand = mean(pct_tapStand), 
                mean_pct_uc300 = mean(pct_uc300), 
                mean_pct_uc1000 = mean(pct_uc1000), 
                mean_pct_uc250 = mean(pct_uc250), 
                mean_pct_urban0 = mean(pct_urban0))
    # A tibble: 5 × 10
      cluster mean_pct_fun…¹ mean_…² mean_…³ mean_…⁴ mean_…⁵ mean_…⁶ mean_…⁷ mean_…⁸
      <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
    1 1                 51.7    29.0   9.25    35.0   0.525     44.1    45.5  0.539 
    2 2                 46.2    42.2  11.3     59.9   1.00      70.6    28.3  1.03  
    3 3                 40.5    49.4   9.15     9.87  0.304     19.2    80.4  0.345 
    4 4                 18.4    14.6  67.0     18.1   0.337     72.1    27.5  0.410 
    5 5                 78.9    20.7   0.295   88.7   0.0677    89.2    10.7  0.0903
    # … with 1 more variable: mean_pct_urban0 <dbl>, and abbreviated variable names
    #   ¹​mean_pct_functional, ²​mean_pct_nonFunctional, ³​mean_pct_unknown,
    #   ⁴​mean_pct_handPump, ⁵​mean_pct_tapStand, ⁶​mean_pct_uc300, ⁷​mean_pct_uc1000,
    #   ⁸​mean_pct_uc250